One problem that might arise with read is what to do when there’s no data yet, but we’re not at end-of-file.
The default answer is “go to sleep waiting for data.” This section shows how a process is put to sleep, how it is awakened, and how an application can ask if there is data without just blindly issuing a read call and blocking. We then apply the same concepts to write.
As usual, before we show actual code, we’ll explain a few concepts.
Whenever a process must wait for an event (such as the arrival of data or the termination of a process), it should go to sleep. Sleeping causes the process to suspend execution, freeing the processor for other uses. At some future time, when the event being waited for occurs, the process will be woken up and will continue with its job. This section discusses the 2.4 machinery for putting a process to sleep and waking it up. Earlier versions are discussed in Section 5.7 later in this chapter.
There are several ways of handling sleeping and waking up in Linux,
each suited to different needs. All, however, work with the same basic
data type, a wait queue (wait_queue_head_t
). A
wait queue is exactly that—a queue of
processes that are waiting for an event. Wait queues are declared and
initialized as follows:
wait_queue_head_t my_queue; init_waitqueue_head (&my_queue);
When a wait queue is declared statically (i.e., not as an automatic variable of a procedure or as part of a dynamically-allocated data structure), it is also possible to initialize the queue at compile time:
DECLARE_WAIT_QUEUE_HEAD (my_queue);
It is a common mistake to neglect to initialize a wait queue (especially since earlier versions of the kernel did not require this initialization); if you forget, the results will usually not be what you intended.
Once the wait queue is declared and initialized, a process may use it to go to sleep. Sleeping is accomplished by calling one of the variants of sleep_on, depending on how deep a sleep is called for.
-
sleep_on(wait_queue_head_t *queue);
Puts the process to sleep on this queue. sleep_on has the disadvantage of not being interruptible; as a result, the process can end up being stuck (and un-killable) if the event it’s waiting for never happens.
-
interruptible_sleep_on(wait_queue_head_t *queue);
The interruptible variant works just like sleep_on, except that the sleep can be interrupted by a signal. This is the form that device driver writers have been using for a long time, before wait_event_interruptible (described later) appeared.
-
sleep_on_timeout(wait_queue_head_t *queue, long timeout);
,interruptible_sleep_on_timeout(wait_queue_head_t *queue, long timeout);
These two functions behave like the previous two, with the exception that the sleep will last no longer than the given timeout period. The timeout is specified in “jiffies,” which are covered in Chapter 6.
-
void wait_event(wait_queue_head_t queue, int condition);
,int wait_event_interruptible(wait_queue_head_t queue, int condition);
These macros are the preferred way to sleep on an event. They combine waiting for an event and testing for its arrival in a way that avoids race conditions. They will sleep until the condition, which may be any boolean C expression, evaluates true. The macros expand to a while loop, and the condition is reevaluated over time—the behavior is different from that of a function call or a simple macro, where the arguments are evaluated only at call time. The latter macro is implemented as an expression that evaluates to 0 in case of success and
-ERESTARTSYS
if the loop is interrupted by a signal.
It is worth repeating that driver writers should almost always use the interruptible instances of these functions/macros. The noninterruptible version exists for the small number of situations in which signals cannot be dealt with, for example, when waiting for a data page to be retrieved from swap space. Most drivers do not present such special situations.
Of course, sleeping is only half of the problem; something, somewhere will have to wake the process up again. When a device driver sleeps directly, there is usually code in another part of the driver that performs the wakeup, once it knows that the event has occurred. Typically a driver will wake up sleepers in its interrupt handler once new data has arrived. Other scenarios are possible, however.
Just as there is more than one way to sleep, so there is also more than one way to wake up. The high-level functions provided by the kernel to wake up processes are as follows:
-
wake_up(wait_queue_head_t *queue);
This function will wake up all processes that are waiting on this event queue.
-
wake_up_interruptible(wait_queue_head_t *queue);
wake_up_interruptible wakes up only the processes that are in interruptible sleeps. Any process that sleeps on the wait queue using a noninterruptible function or macro will continue to sleep.
-
wake_up_sync(wait_queue_head_t *queue);
,wake_up_interruptible_sync(wait_queue_head_t *queue);
Normally, a wake_up call can cause an immediate reschedule to happen, meaning that other processes might run before wake_up returns. The “synchronous” variants instead make any awakened processes runnable, but do not reschedule the CPU. This is used to avoid rescheduling when the current process is known to be going to sleep, thus forcing a reschedule anyway. Note that awakened processes could run immediately on a different processor, so these functions should not be expected to provide mutual exclusion.
If your driver is using interruptible_sleep_on, there is little difference between wake_up and wake_up_interruptible. Calling the latter is a common convention, however, to preserve consistency between the two calls.
As an example of wait queue usage, imagine you want to put a process to sleep when it reads your device and awaken it when someone else writes to the device. The following code does just that:
DECLARE_WAIT_QUEUE_HEAD(wq); ssize_t sleepy_read (struct file *filp, char *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) going to sleep\n", current->pid, current->comm); interruptible_sleep_on(&wq); printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm); return 0; /* EOF */ } ssize_t sleepy_write (struct file *filp, const char *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) awakening the readers...\n", current->pid, current->comm); wake_up_interruptible(&wq); return count; /* succeed, to avoid retrial */ }
The code for this device is available as sleepy in the example programs and can be tested using cat and input/output redirection, as usual.
An important thing to remember with wait queues is that being woken up does not guarantee that the event you were waiting for has occurred; a process can be woken for other reasons, mainly because it received a signal. Any code that sleeps should do so in a loop that tests the condition after returning from the sleep, as discussed in Section 5.2.5 later in this chapter.
The previous discussion is all that most driver writers will need to know to get their job done. Some, however, will want to dig deeper. This section attempts to get the curious started; everybody else can skip to the next section without missing much that is important.
The wait_queue_head_t
type is a fairly simple
structure, defined in <linux/wait.h>
. It
contains only a lock variable and a linked list of sleeping
processes. The individual data items in the list are of type
wait_queue_t
, and the list is the generic list
defined in <linux/list.h>
and described in
Section 10.5 in Chapter 10. Normally the
wait_queue_t
structures are allocated on the stack
by functions like interruptible_sleep_on; the
structures end up in the stack because they are simply declared as
automatic variables in the relevant functions. In general, the
programmer need not deal with them.
Some advanced applications, however, can require dealing with
wait_queue_t
variables directly. For these, it’s
worth a quick look at what actually goes on inside a function like
interruptible_sleep_on. The following is a
simplified version of the implementation of
interruptible_sleep_on to put a process to sleep:
void simplified_sleep_on(wait_queue_head_t *queue) { wait_queue_t wait; init_waitqueue_entry(&wait, current); current->state = TASK_INTERRUPTIBLE; add_wait_queue(queue, &wait); schedule(); remove_wait_queue (queue, &wait); }
The code here creates a new wait_queue_t
variable
(wait
, which gets allocated on the stack) and
initializes it. The state of the task is set to
TASK_INTERRUPTIBLE
, meaning that it is in an
interruptible sleep. The wait queue entry is then added to the queue
(the wait_queue_head_t *
argument). Then
schedule is called, which relinquishes the
processor to somebody else. schedule returns only
when somebody else has woken up the process and set its state to
TASK_RUNNING
. At that point, the wait queue entry
is removed from the queue, and the sleep is done.
Figure 5-1 shows the internals of the data structures involved in wait queues and how they are used by processes.
A quick look through the kernel shows that a great many procedures do their sleeping “manually” with code that looks like the previous example. Most of those implementations date back to kernels prior to 2.2.3, before wait_event was introduced. As suggested, wait_event is now the preferred way to sleep on an event, because interruptible_sleep_on is subject to unpleasant race conditions. A full description of how that can happen will have to wait until Section 9.8.4 in Chapter 9; the short version, simply, is that things can change in the time between when your driver decides to sleep and when it actually gets around to calling interruptible_sleep_on.
One other reason for calling the scheduler explicitly, however, is to do exclusive waits. There can be situations in which several processes are waiting on an event; when wake_up is called, all of those processes will try to execute. Suppose that the event signifies the arrival of an atomic piece of data. Only one process will be able to read that data; all the rest will simply wake up, see that no data is available, and go back to sleep.
This situation is sometimes referred to as the “thundering herd problem.” In high-performance situations, thundering herds can waste resources in a big way. The creation of a large number of runnable processes that can do no useful work generates a large number of context switches and processor overhead, all for nothing. Things would work better if those processes simply remained asleep.
For this reason, the 2.3 development series added the concept of an exclusive sleep. If processes sleep in an exclusive mode, they are telling the kernel to wake only one of them. The result is improved performance in some situations.
The code to perform an exclusive sleep looks very similar to that for a regular sleep:
void simplified_sleep_exclusive(wait_queue_head_t *queue) { wait_queue_t wait; init_waitqueue_entry(&wait, current); current->state = TASK_INTERRUPTIBLE | TASK_EXCLUSIVE; add_wait_queue_exclusive(queue, &wait); schedule(); remove_wait_queue (queue, &wait); }
Adding the TASK_EXCLUSIVE
flag to the task state
indicates that the process is in an exclusive wait. The call to
add_wait_queue_exclusive is also necessary,
however. That function adds the process to the
end of the wait queue, behind all others. The
purpose is to leave any processes in nonexclusive sleeps at the
beginning, where they will always be awakened. As soon as
wake_up hits the first exclusive sleeper, it
knows it can stop.
The attentive reader may have noticed another reason to manipulate wait queues and the scheduler explicitly. Whereas functions like sleep_on will block a process on exactly one wait queue, working with the queues directly allows sleeping on multiple queues simultaneously. Most drivers need not sleep on more than one queue; if yours is the exception, you will need to use code like what we’ve shown.
Those wanting to dig even deeper into the wait queue code can look at
<linux/sched.h>
and
kernel/sched.c
.
When a process is put to sleep, the driver is still alive and can be
called by another process. Let’s consider the console driver as an
example. While an application is waiting for keyboard input on
tty1
, the user switches to tty2
and spawns a new shell. Now both shells are waiting for keyboard
input within the console driver, although they sleep on different wait
queues: one on the queue associated with tty1
and
the other on the queue associated with tty2
. Each
process is blocked within the
interruptible_sleep_on function, but the driver
can still receive and answer requests from other ttys.
Of course, on SMP systems, multiple simultaneous calls to your driver can happen even when you do not sleep.
Such situations can be handled painlessly by writing reentrant code. Reentrant code is code that doesn’t keep status information in global variables and thus is able to manage interwoven invocations without mixing anything up. If all the status information is process specific, no interference will ever happen.
If status information is needed, it can either be kept in local
variables within the driver function (each process has a different
stack page in kernel space where local variables are stored), or it
can reside in private_data
within the
filp
accessing the file. Using local variables is
preferred because sometimes the same filp
can be
shared between two processes (usually parent and child).
If you need to save large amounts of status data, you can keep the pointer in a local variable and use kmalloc to retrieve the actual storage space. In this case you must remember to kfree the data, because there’s no equivalent to “everything is released at process termination” when you’re working in kernel space. Using local variables for large items is not good practice, because the data may not fit the single page of memory allocated for stack space.
You need to make reentrant any function that matches either of two conditions. First, if it calls schedule, possibly by calling sleep_on or wake_up. Second, if it copies data to or from user space, because access to user space might page-fault, and the process will be put to sleep while the kernel deals with the missing page. Every function that calls any such functions must be reentrant as well. For example, if sample_read calls sample_getdata, which in turn can block, then sample_read must be reentrant as well as sample_getdata, because nothing prevents another process from calling it while it is already executing on behalf of a process that went to sleep.
Finally, of course, code that sleeps should always keep in mind that the state of the system can change in almost any way while a process is sleeping. The driver should be careful to check any aspect of its environment that might have changed while it wasn’t paying attention.
Another point we need to touch on before we look at the implementation
of full-featured read and
write methods is the role of the
O_NONBLOCK
flag in
filp->f_flags
. The flag is defined in
<linux/fcntl.h>
, which is automatically
included by <linux/fs.h>
.
The flag gets its name from “open-nonblock,” because it can be
specified at open time (and originally could only be specified there).
If you browse the source code, you’ll find some references to an
O_NDELAY
flag; this is an alternate name for
O_NONBLOCK
, accepted for compatibility with System
V code. The flag is cleared by default, because the normal behavior
of a process waiting for data is just to sleep. In the case of a
blocking operation, which is the default, the following behavior
should be implemented in order to adhere to the standard semantics:
If a process calls read but no data is (yet) available, the process must block. The process is awakened as soon as some data arrives, and that data is returned to the caller, even if there is less than the amount requested in the
count
argument to the method.If a process calls write and there is no space in the buffer, the process must block, and it must be on a different wait queue from the one used for reading. When some data has been written to the hardware device, and space becomes free in the output buffer, the process is awakened and the write call succeeds, although the data may be only partially written if there isn’t room in the buffer for the
count
bytes that were requested.
Both these statements assume that there are both input and output buffers; in practice, almost every device driver has them. The input buffer is required to avoid losing data that arrives when nobody is reading. In contrast, data can’t be lost on write, because if the system call doesn’t accept data bytes, they remain in the user-space buffer. Even so, the output buffer is almost always useful for squeezing more performance out of the hardware.
The performance gain of implementing an output buffer in the driver results from the reduced number of context switches and user-level/kernel-level transitions. Without an output buffer (assuming a slow device), only one or a few characters are accepted by each system call, and while one process sleeps in write, another process runs (that’s one context switch). When the first process is awakened, it resumes (another context switch), write returns (kernel/user transition), and the process reiterates the system call to write more data (user/kernel transition); the call blocks, and the loop continues. If the output buffer is big enough, the write call succeeds on the first attempt—the buffered data will be pushed out to the device later, at interrupt time—without control needing to go back to user space for a second or third write call. The choice of a suitable size for the output buffer is clearly device specific.
We didn’t use an input buffer in scull, because data is already available when read is issued. Similarly, no output buffer was used, because data is simply copied to the memory area associated with the device. Essentially, the device is a buffer, so the implementation of additional buffers would be superfluous. We’ll see the use of buffers in Chapter 9, in Section 9.7.
The behavior of read and
write is different if
O_NONBLOCK
is specified. In this case, the calls
simply return -EAGAIN
if a process calls
read when no data is available or if it calls
write when there’s no space in the buffer.
As you might expect, nonblocking operations return immediately,
allowing the application to poll for data. Applications must be
careful when using the stdio functions while
dealing with nonblocking files, because they can easily mistake a
nonblocking return for EOF
. They always have to
check errno
.
Naturally, O_NONBLOCK
is meaningful in the
open method also. This happens when the call can
actually block for a long time; for example, when opening a FIFO that
has no writers (yet), or accessing a disk file with a pending lock.
Usually, opening a device either succeeds or fails, without the need
to wait for external events. Sometimes, however, opening the device
requires a long initialization, and you may choose to support
O_NONBLOCK
in your open method
by returning immediately with -EAGAIN
(“try it
again”) if the flag is set, after initiating device
initialization. The driver may also implement a blocking
open to support access policies in a way
similar to file locks. We’ll see one such implementation in Section 5.6.4 later in this chapter.
Some drivers may also implement special semantics for
O_NONBLOCK
; for example, an open of a tape device
usually blocks until a tape has been inserted. If the tape drive is
opened with O_NONBLOCK
, the open succeeds
immediately regardless of whether the media is present or not.
Only the read, write, and open file operations are affected by the nonblocking flag.
The /dev/scullpipe
devices (there are four of
them by default) are part of the scull
module and are used to show how blocking I/O is implemented.
Within a driver, a process blocked in a read call is awakened when data arrives; usually the hardware issues an interrupt to signal such an event, and the driver awakens waiting processes as part of handling the interrupt. The scull driver works differently, so that it can be run without requiring any particular hardware or an interrupt handler. We chose to use another process to generate the data and wake the reading process; similarly, reading processes are used to wake sleeping writer processes. The resulting implementation is similar to that of a FIFO (or named pipe) filesystem node, whence the name.
The device driver uses a device structure that embeds two wait queues and a buffer. The size of the buffer is configurable in the usual ways (at compile time, load time, or runtime).
typedef struct Scull_Pipe { wait_queue_head_t inq, outq; /* read and write queues */ char *buffer, *end; /* begin of buf, end of buf */ int buffersize; /* used in pointer arithmetic */ char *rp, *wp; /* where to read, where to write */ int nreaders, nwriters; /* number of openings for r/w */ struct fasync_struct *async_queue; /* asynchronous readers */ struct semaphore sem; /* mutual exclusion semaphore */ devfs_handle_t handle; /* only used if devfs is there */ } Scull_Pipe;
The read implementation manages both blocking and nonblocking input and looks like this (the puzzling first line of the function is explained later, in Section 5.5):
ssize_t scull_p_read (struct file *filp, char *buf, size_t count, loff_t *f_pos) { Scull_Pipe *dev = filp->private_data; if (f_pos != &filp->f_pos) return -ESPIPE; if (down_interruptible(&dev->sem)) return -ERESTARTSYS; while (dev->rp == dev->wp) { /* nothing to read */ up(&dev->sem); /* release the lock */ if (filp->f_flags & O_NONBLOCK) return -EAGAIN; PDEBUG("\"%s\" reading: going to sleep\n", current->comm); if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp))) return -ERESTARTSYS; /* signal: tell the fs layer to handle it */ /* otherwise loop, but first reacquire the lock */ if (down_interruptible(&dev->sem)) return -ERESTARTSYS; } /* ok, data is there, return something */ if (dev->wp > dev->rp) count = min(count, dev->wp - dev->rp); else /* the write pointer has wrapped, return data up to dev->end */ count = min(count, dev->end - dev->rp); if (copy_to_user(buf, dev->rp, count)) { up (&dev->sem); return -EFAULT; } dev->rp += count; if (dev->rp == dev->end) dev->rp = dev->buffer; /* wrapped */ up (&dev->sem); /* finally, awaken any writers and return */ wake_up_interruptible(&dev->outq); PDEBUG("\"%s\" did read %li bytes\n",current->comm, (long)count); return count; }
As you can see, we left some PDEBUG
statements in
the code. When you compile the driver, you can enable messaging to
make it easier to follow the interaction of different processes.
Note also, once again, the use of semaphores to protect critical regions of the code. The scull code has to be careful to avoid going to sleep when it holds a semaphore—otherwise, writers would never be able to add data, and the whole thing would deadlock. This code uses wait_event_interruptible to wait for data if need be; it has to check for available data again after the wait, though. Somebody else could grab the data between when we wake up and when we get the semaphore back.
It’s worth repeating that a process can go to sleep both when it calls schedule, either directly or indirectly, and when it copies data to or from user space. In the latter case the process may sleep if the user array is not currently present in main memory. If scull sleeps while copying data between kernel and user space, it will sleep with the device semaphore held. Holding the semaphore in this case is justified since it will not deadlock the system, and since it is important that the device memory array not change while the driver sleeps.
The if
statement that follows
interruptible_sleep_on takes care of signal
handling. This statement ensures the proper and expected reaction to
signals, which could have been responsible for waking up the process
(since we were in an interruptible sleep). If a signal has arrived and
it has not been blocked by the process, the proper behavior is to let
upper layers of the kernel handle the event. To this aim, the driver
returns -ERESTARTSYS
to the caller; this value is
used internally by the virtual filesystem (VFS) layer, which either
restarts the system call or returns -EINTR
to user
space. We’ll use the same statement to deal with signal handling for
every read and write
implementation. Because signal_pending was
introduced only in version 2.1.57 of the kernel,
sysdep.h
defines it for earlier kernels to preserve
portability of source code.
The implementation for write is quite similar to
that for read (and, again, its first line will be
explained later). Its only “peculiar” feature is that it never
completely fills the buffer, always leaving a hole of at least one
byte. Thus, when the buffer is empty, wp
and
rp
are equal; when there is data there, they are
always different.
static inline int spacefree(Scull_Pipe *dev) { if (dev->rp == dev->wp) return dev->buffersize - 1; return ((dev->rp + dev->buffersize - dev->wp) % dev->buffersize) - 1; } ssize_t scull_p_write(struct file *filp, const char *buf, size_t count, loff_t *f_pos) { Scull_Pipe *dev = filp->private_data; if (f_pos != &filp->f_pos) return -ESPIPE; if (down_interruptible(&dev->sem)) return -ERESTARTSYS; /* Make sure there's space to write */ while (spacefree(dev) == 0) { /* full */ up(&dev->sem); if (filp->f_flags & O_NONBLOCK) return -EAGAIN; PDEBUG("\"%s\" writing: going to sleep\n",current->comm); if (wait_event_interruptible(dev->outq, spacefree(dev) > 0)) return -ERESTARTSYS; /* signal: tell the fs layer to handle it */ if (down_interruptible(&dev->sem)) return -ERESTARTSYS; } /* ok, space is there, accept something */ count = min(count, spacefree(dev)); if (dev->wp >= dev->rp) count = min(count, dev->end - dev->wp); /* up to end-of-buffer */ else /* the write pointer has wrapped, fill up to rp-1 */ count = min(count, dev->rp - dev->wp - 1); PDEBUG("Going to accept %li bytes to %p from %p\n", (long)count, dev->wp, buf); if (copy_from_user(dev->wp, buf, count)) { up (&dev->sem); return -EFAULT; } dev->wp += count; if (dev->wp == dev->end) dev->wp = dev->buffer; /* wrapped */ up(&dev->sem); /* finally, awaken any reader */ wake_up_interruptible(&dev->inq); /* blocked in read() and select() */ /* and signal asynchronous readers, explained later in Chapter 5 */ if (dev->async_queue) kill_fasync(&dev->async_queue, SIGIO, POLL_IN); PDEBUG("\"%s\" did write %li bytes\n",current->comm, (long)count); return count; }
The device, as we conceived it, doesn’t implement blocking
open and is simpler than a real FIFO. If you want
to look at the real thing, you can find it in
fs/pipe.c
, in the kernel sources.
To test the blocking operation of the
scullpipe device, you can run some programs
on it, using input/output redirection as usual. Testing nonblocking
activity is trickier, because the conventional programs don’t perform
nonblocking operations. The misc-progs
source
directory contains the following simple program, called
nbtest, for testing nonblocking
operations. All it does is copy its input to its output, using
nonblocking I/O and delaying between retrials. The delay time is
passed on the command line and is one second by default.
int main(int argc, char **argv) { int delay=1, n, m=0; if (argc>1) delay=atoi(argv[1]); fcntl(0, F_SETFL, fcntl(0,F_GETFL) | O_NONBLOCK); /* stdin */ fcntl(1, F_SETFL, fcntl(1,F_GETFL) | O_NONBLOCK); /* stdout */ while (1) { n=read(0, buffer, 4096); if (n>=0) m=write(1, buffer, n); if ((n<0 || m<0) && (errno != EAGAIN)) break; sleep(delay); } perror( n<0 ? "stdin" : "stdout"); exit(1); }
Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.