Applications that use nonblocking I/O often use the poll and select system calls as well. poll and select have essentially the same functionality: both allow a process to determine whether it can read from or write to one or more open files without blocking. They are thus often used in applications that must use multiple input or output streams without blocking on any one of them. The same functionality is offered by two separate functions because they were implemented in Unix almost at the same time by two different groups: select was introduced in BSD Unix, whereas poll was the System V solution.
Support for either system call requires support from the device driver to function. In version 2.0 of the kernel the device method was modeled on select (and no poll was available to user programs); from version 2.1.23 onward both were offered, and the device method was based on the newly introduced poll system call because poll offered more detailed control than select.
Implementations of the poll method, implementing both the poll and select system calls, have the following prototype:
unsigned int (*poll) (struct file *, poll_table *);
The driver’s method will be called whenever the user-space program performs a poll or select system call involving a file descriptor associated with the driver. The device method is in charge of these two steps:
Both of these operations are usually straightforward, and tend to look very similar from one driver to the next. They rely, however, on information that only the driver can provide, and thus must be implemented individually by each driver.
The poll_table
structure, the second argument to
the poll method, is used within the kernel to
implement the poll and
select calls; it is declared in
<linux/poll.h>
, which must be included by the
driver source. Driver writers need know nothing about its internals
and must use it as an opaque object; it is passed to the driver method
so that every event queue that could wake up the process and change
the status of the poll operation can be added to
the poll_table
structure by calling the function
poll_wait:
void poll_wait (struct file *, wait_queue_head_t *, poll_table *);
The second task performed by the poll method is returning the bit mask
describing which operations could be completed immediately; this is
also straightforward. For example, if the device has data available, a
read would complete without sleeping; the
poll method should indicate this state of
affairs. Several flags (defined in
<linux/poll.h>
) are used to indicate the
possible operations:
-
POLLIN
This bit must be set if the device can be read without blocking.
-
POLLRDNORM
This bit must be set if “normal” data is available for reading. A readable device returns
(POLLIN | POLLRDNORM)
.-
POLLRDBAND
This bit indicates that out-of-band data is available for reading from the device. It is currently used only in one place in the Linux kernel (the DECnet code) and is not generally applicable to device drivers.
-
POLLPRI
High-priority data (out-of-band) can be read without blocking. This bit causes select to report that an exception condition occurred on the file, because select reports out-of-band data as an exception condition.
-
POLLHUP
When a process reading this device sees end-of-file, the driver must set
POLLHUP
(hang-up). A process calling select will be told that the device is readable, as dictated by the select functionality.-
POLLERR
An error condition has occurred on the device. When poll is invoked, the device is reported as both readable and writable, since both read and write will return an error code without blocking.
-
POLLOUT
This bit is set in the return value if the device can be written to without blocking.
-
POLLWRNORM
This bit has the same meaning as
POLLOUT
, and sometimes it actually is the same number. A writable device returns(POLLOUT | POLLWRNORM)
.-
POLLWRBAND
Like
POLLRDBAND
, this bit means that data with nonzero priority can be written to the device. Only the datagram implementation of poll uses this bit, since a datagram can transmit out of band data.
It’s worth noting that POLLRDBAND
and
POLLWRBAND
are meaningful only with file
descriptors associated with sockets: device drivers won’t normally use
these flags.
The description of poll takes up a lot of space for something that is relatively simple to use in practice. Consider the scullpipe implementation of the poll method:
unsigned int scull_p_poll(struct file *filp, poll_table *wait) { Scull_Pipe *dev = filp->private_data; unsigned int mask = 0; /* * The buffer is circular; it is considered full * if "wp" is right behind "rp". "left" is 0 if the * buffer is empty, and it is "1" if it is completely full. */ int left = (dev->rp + dev->buffersize - dev->wp) % dev->buffersize; poll_wait(filp, &dev->inq, wait); poll_wait(filp, &dev->outq, wait); if (dev->rp != dev->wp) mask |= POLLIN | POLLRDNORM; /* readable */ if (left != 1) mask |= POLLOUT | POLLWRNORM; /* writable */ return mask; }
This code simply adds the two scullpipe
wait queues to the poll_table
, then sets the
appropriate mask bits depending on whether data can be read or
written.
The poll code as shown is missing end-of-file
support. The poll method should return
POLLHUP
when the device is at the end of the
file. If the caller used the select system call,
the file will be reported as readable; in both cases the application
will know that it can actually issue the read
without waiting forever, and the read method will
return 0 to signal end-of-file.
With real FIFOs, for example, the reader sees an end-of-file when all the writers close the file, whereas in scullpipe the reader never sees end-of-file. The behavior is different because a FIFO is intended to be a communication channel between two processes, while scullpipe is a trashcan where everyone can put data as long as there’s at least one reader. Moreover, it makes no sense to reimplement what is already available in the kernel.
Implementing end-of-file in the same way as FIFOs do would mean
checking dev->nwriters
, both in
read and in poll, and
reporting end-of-file (as just described) if no process has the device
opened for writing. Unfortunately, though, if a reader opened the
scullpipe device before the writer, it
would see end-of-file without having a chance to wait for data. The
best way to fix this problem would be to implement blocking within
open; this task is left as an exercise for the
reader.
The purpose of the poll and select calls is to determine in advance if an I/O operation will block. In that respect, they complement read and write. More important, poll and select are useful because they let the application wait simultaneously for several data streams, although we are not exploiting this feature in the scull examples.
A correct implementation of the three calls is essential to make applications work correctly. Though the following rules have more or less already been stated, we’ll summarize them here.
If there is data in the input buffer, the read call should return immediately, with no noticeable delay, even if less data is available than the application requested and the driver is sure the remaining data will arrive soon. You can always return less data than you’re asked for if this is convenient for any reason (we did it in scull), provided you return at least one byte.
If there is no data in the input buffer, by default read must block until at least one byte is there. If
O_NONBLOCK
is set, on the other hand, read returns immediately with a return value of-EAGAIN
(although some old versions of System V return 0 in this case). In these cases poll must report that the device is unreadable until at least one byte arrives. As soon as there is some data in the buffer, we fall back to the previous case.If we are at end-of-file, read should return immediately with a return value of 0, independent of
O_NONBLOCK
. poll should reportPOLLHUP
in this case.
If there is space in the output buffer, write should return without delay. It can accept less data than the call requested, but it must accept at least one byte. In this case, poll reports that the device is writable.
If the output buffer is full, by default write blocks until some space is freed. If
O_NONBLOCK
is set, write returns immediately with a return value of-EAGAIN
(older System V Unices returned 0). In these cases poll should report that the file is not writable. If, on the other hand, the device is not able to accept any more data, write returns-ENOSPC
(“No space left on device”), independently of the setting ofO_NONBLOCK
.Never make a write call wait for data transmission before returning, even if
O_NONBLOCK
is clear. This is because many applications use select to find out whether a write will block. If the device is reported as writable, the call must consistently not block. If the program using the device wants to ensure that the data it enqueues in the output buffer is actually transmitted, the driver must provide an fsync method. For instance, a removable device should have an fsync entry point.
Although these are a good set of general rules, one should also recognize that each device is unique and that sometimes the rules must be bent slightly. For example, record-oriented devices (such as tape drives) cannot execute partial writes.
We’ve seen how the write method by itself doesn’t account for all data output needs. The fsync function, invoked by the system call of the same name, fills the gap. This method’s prototype is
int (*fsync) (struct file *file, struct dentry *dentry, int datasync);
If some application will ever need to be assured that data has been
sent to the device, the fsync method must be
implemented. A call to fsync should return only
when the device has been completely flushed (i.e., the output buffer
is empty), even if that takes some time, regardless of whether
O_NONBLOCK
is set. The datasync
argument, present only in the 2.4 kernel, is used to distinguish
between the fsync and
fdatasync system calls; as such, it is only of
interest to filesystem code and can be ignored by drivers.
The fsync method has no unusual features. The
call isn’t time critical, so every device driver can implement it to
the author’s taste. Most of the time, char drivers just have a
NULL
pointer in their
fops
. Block devices, on the other hand, always
implement the method with the general-purpose
block_fsync, which in turn flushes all the blocks
of the device, waiting for I/O to complete.
The actual implementation of the poll and
select system calls is reasonably simple, for
those who are interested in how it works. Whenever a user application
calls either function, the kernel invokes the
poll method of all files referenced by the system
call, passing the same poll_table
to each of
them. The structure is, for all practical purposes, an array of
poll_table_entry
structures allocated for a
specific poll or select
call. Each poll_table_entry
contains the
struct file
pointer for the open device, a
wait_queue_head_t
pointer, and a
wait_queue_t
entry. When a driver calls
poll_wait, one of these entries gets filled in
with the information provided by the driver, and the wait queue entry
gets put onto the driver’s queue. The pointer to
wait_queue_head_t
is used to track the wait queue
where the current poll table entry is registered, in order for
free_wait to be able to dequeue the entry before
the wait queue is awakened.
If none of the drivers being polled indicates that I/O can occur without blocking, the poll call simply sleeps until one of the (perhaps many) wait queues it is on wakes it up.
What’s interesting in the implementation of poll
is that the file operation may be called with a
NULL
pointer as poll_table
argument. This situation can come about for a couple of reasons. If
the application calling poll has provided a
timeout value of 0 (indicating that no wait should be done), there is
no reason to accumulate wait queues, and the system simply does not do
it. The poll_table
pointer is also set to
NULL
immediately after any driver being
polled indicates that I/O is possible. Since the
kernel knows at that point that no wait will occur, it does not build
up a list of wait queues.
When the poll call completes, the
poll_table
structure is deallocated, and all wait
queue entries previously added to the poll table (if any) are removed
from the table and their wait queues.
Actually, things are somewhat more complex than depicted here, because the poll table is not a simple array but rather a set of one or more pages, each hosting an array. This complication is meant to avoid putting too low a limit (dictated by the page size) on the maximum number of file descriptors involved in a poll or select system call.
We tried to show the data structures involved in polling in Figure 5-2; the figure is a simplified representation of the
real data structures because it ignores the multipage nature of a poll
table and disregards the file pointer that is part of each
poll_table_entry
. The reader interested in the
actual implementation is urged to look in
<linux/poll.h>
and
fs/select.c
.
Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.