The sbull driver as described earlier works
very well. In simple situations (as with
sbull), the macros from
<linux/blk.h>
can be used to easily set up a
request function and get a working driver. As
has already been mentioned, however, block drivers are often a
performance-critical part of the kernel. Drivers based on the simple
code shown earlier will likely not perform very well in many
situations, and can also be a drag on the system as a whole. In this
section we get into the details of how the I/O request queue works
with an eye toward writing a faster, more efficient driver.
Each block driver works with at least one I/O request queue. This queue contains, at any given time, all of the I/O operations that the kernel would like to see done on the driver’s devices. The management of this queue is complicated; the performance of the system depends on how it is done.
The queue is designed with physical disk drives in mind. With disks, the amount of time required to transfer a block of data is typically quite small. The amount of time required to position the head (seek) to do that transfer, however, can be very large. Thus the Linux kernel works to minimize the number and extent of the seeks performed by the device.
Two things are done to achieve those goals. One is the clustering of requests to adjacent sectors on the disk. Most modern filesystems will attempt to lay out files in consecutive sectors; as a result, requests to adjoining parts of the disk are common. The kernel also applies an “elevator” algorithm to the requests. An elevator in a skyscraper is either going up or down; it will continue to move in those directions until all of its “requests” (people wanting on or off) have been satisfied. In the same way, the kernel tries to keep the disk head moving in the same direction for as long as possible; this approach tends to minimize seek times while ensuring that all requests get satisfied eventually.
A Linux I/O request queue is represented by a structure of type
request_queue
, declared in
<linux/blkdev.h>
. The
request_queue
structure looks somewhat like
file_operations
and other such objects, in that it
contains pointers to a number of functions that operate on the
queue—for example, the driver’s request
function is stored there. There is also a queue head (using the
functions from <linux/list.h>
described in
Section 10.5 in Chapter 10), which
points to the list of outstanding requests to the device.
These requests are, of course, of type struct request
; we have already looked at some of the fields in
this structure. The reality of the request
structure is a little more complicated, however; understanding it
requires a brief digression into the structure of the Linux buffer
cache.
The design of the request
structure is driven by
the Linux memory management scheme. Like most Unix-like systems,
Linux maintains a buffer cache, a region of
memory that is used to hold copies of blocks stored on disk. A great
many “disk” operations performed at higher levels of the
kernel—such as in the filesystem code—act only on the
buffer cache and do not generate any actual I/O operations. Through
aggressive caching the kernel can avoid many read operations
altogether, and multiple writes can often be merged into a single
physical write to disk.
One unavoidable aspect of the buffer cache, however, is that blocks
that are adjacent on disk are almost certainly
not adjacent in memory. The buffer cache is a
dynamic thing, and blocks end up being scattered widely. In order to
keep track of everything, the kernel manages the buffer cache through
buffer_head
structures. One
buffer_head
is associated with each data buffer.
This structure contains a great many fields, most of which do not
concern a driver writer. There are a few that are important, however,
including the following:
-
char *b_data;
The actual data block associated with this buffer head.
-
unsigned long b_size;
The size of the block pointed to by
b_data
.-
kdev_t b_rdev;
The device holding the block represented by this buffer head.
-
unsigned long b_rsector;
The sector number where this block lives on disk.
-
struct buffer_head *b_reqnext;
A pointer to a linked list of buffer head structures in the request queue.
-
void (*b_end_io)(struct buffer_head *bh, int uptodate);
A pointer to a function to be called when I/O on this buffer completes.
bh
is the buffer head itself, anduptodate
is nonzero if the I/O was successful.
Every block passed to a driver’s request function
either lives in the buffer cache, or, on rare occasion, lives
elsewhere but has been made to look as if it lived in the buffer
cache.[49] As a result, every
request passed to the driver deals with one or more
buffer_head
structures. The
request
structure contains a member (called simply
bh
) that points to a linked list of these
structures; satisfying the request requires performing the indicated
I/O operation on each buffer in the list. Figure 12-2 shows how the request queue and
buffer_head
structures fit together.
Requests are not made of random lists of buffers; instead, all of the buffer heads attached to a single request will belong to a series of adjacent blocks on the disk. Thus a request is, in a sense, a single operation referring to a (perhaps long) group of blocks on the disk. This grouping of blocks is called clustering, and we will look at it in detail after completing our discussion of how the request list works.
The header <linux/blkdev.h>
defines a small
number of functions that manipulate the request queue, most of which
are implemented as preprocessor macros. Not all drivers will need to
work with the queue at this level, but a familiarity with how it all
works can be helpful. Most request queue functions will be introduced
as we need them, but a few are worth mentioning here.
-
struct request *blkdev_entry_next_request(struct list_head *head);
Returns the next entry in the request list. Usually the
head
argument is thequeue_head
member of therequest_queue
structure; in this case the function returns the first entry in the queue. The function uses the list_entry macro to look in the list.-
struct request *blkdev_next_request(struct request *req);
,struct request *blkdev_prev_request(struct request *req);
Given a request structure, return the next or previous structure in the request queue.
-
blkdev_dequeue_request(struct request *req);
-
blkdev_release_request(struct request *req);
Releases a request structure back to the kernel when it has been completely executed. Each request queue maintains its own free list of request structures (two, actually: one for reads and one for writes); this function places a structure back on the proper free list. blkdev_release_request will also wake up any processes that are waiting on a free request structure.
All of these functions require that the
io_request_lock
be held, which we will discuss
next.
The I/O request queue is a complex data structure that is accessed in many places in the kernel. It is entirely possible that the kernel needs to add more requests to the queue at the same time that your driver is taking requests off. The queue is thus subject to the usual sort of race conditions, and must be protected accordingly.
In Linux 2.2 and 2.4, all request queues are protected with a single
global spinlock called io_request_lock
. Any code
that manipulates a request queue must hold that lock
and disable interrupts, with one small exception:
the very first entry in the request queue is (by default) considered
to be owned by the driver. Failure to acquire the
io_request_lock
prior to working with the request
queue can cause the queue to be corrupted, with a system crash
following shortly thereafter.
The simple request function shown earlier did not
need to worry about this lock because the kernel always calls the
request function with the
io_request_lock
held. A driver is thus protected
against corrupting the request queue; it is also protected against
reentrant calls to the request function. This
scheme was designed to enable drivers that are not SMP aware to
function on multiprocessor systems.
Note, however, that the io_request_lock
is an
expensive resource to hold. As long as your driver holds this lock,
no other requests may be queued to any block driver in the system, and
no other request functions may be called. A
driver that holds this lock for a long time may well slow down the
system as a whole.
Thus, well-written block drivers often drop this lock as soon as
possible. We will see an example of how this can be done shortly.
Block drivers that drop the io_request_lock
must be
written with a couple of important things in mind, however. First is
that the request function must always reacquire
this lock before returning, since the calling code expects it to still
be held. The other concern is that, as soon as the
io_request_lock
is dropped, the possibility of
reentrant calls to the request function is very
real; the function must be written to handle that eventuality.
A variant of this latter case can also occur if your request function returns while an I/O request is still active. Many drivers for real hardware will start an I/O operation, then return; the work is completed in the driver’s interrupt handler. We will look at interrupt-driven block I/O in detail later in this chapter; for now it is worth mentioning, however, that the request function can be called while these operations are still in progress.
Some drivers handle request function reentrancy by maintaining an internal request queue. The request function simply removes any new requests from the I/O request queue and adds them to the internal queue, which is then processed through a combination of tasklets and interrupt handlers.
In our simple request function earlier, we were
not concerned with buffer_head
structures or linked
lists. The macros and functions in
<linux/blk.h>
hide the structure of the I/O
request queue in order
to make the task of writing a block driver simpler. In many cases,
however, getting reasonable performance requires a deeper
understanding of how the queue works. In this section we look at the
actual steps involved in manipulating the request queue; subsequent
sections show some more advanced techniques for writing block
request functions.
The fields of the request
structure that we looked
at earlier—sector
,
current_nr_sectors
, and
buffer
—are really just copies of the
analogous information stored in the first
buffer_head
structure on the list. Thus, a
request function that uses this information from
the CURRENT
pointer is just processing the first of
what might be many buffers within the request. The task of splitting
up a multibuffer request into (seemingly) independent, single-buffer
requests is handled by two important definitions in
<linux/blk.h>
: the
INIT_REQUEST
macro and the
end_request function.
Of the two, INIT_REQUEST
is the simpler; all it
really does is make a couple of consistency checks on the request
queue and cause a return from the request
function if the queue is empty. It is simply making sure that there
is still work to do.
The bulk of the queue management work is done by end_request. This function, remember, is called when the driver has processed a single “request” (actually one buffer); it has several tasks to perform:
Complete the I/O processing on the current buffer; this involves calling the b_end_io function with the status of the operation, thus waking any process that may be sleeping on the buffer.
Remove the buffer from the request’s linked list. If there are further buffers to be processed, the
sector
,current_nr_sectors
, andbuffer
fields in the request structure are updated to reflect the contents of the nextbuffer_head
structure in the list. In this case (there are still buffers to be transferred), end_request is finished for this iteration and steps 3 to 5 are not executed.Call add_blkdev_randomness to update the entropy pool, unless
DEVICE_NO_RANDOM
has been defined (as is done in the sbull driver).Remove the finished request from the request queue by calling blkdev_dequeue_request. This step modifies the request queue, and thus must be performed with the
io_request_lock
held.Release the finished request back to the system;
io_request_lock
is required here too.
The kernel defines a couple of helper functions that are used by end_request to do most of this work. The first one is called end_that_request_first, which handles the first two steps just described. Its prototype is
int end_that_request_first(struct request *req, int status, char *name);
status
is the status of the request as passed to
end_request; the name
parameter is the device name, to be used when printing error messages.
The return value is nonzero if there are more buffers to be processed
in the current request; in that case the work is done. Otherwise, the
request is dequeued and released with
end_that_request_last:
void end_that_request_last(struct request *req);
In end_request this step is handled with this code:
struct request *req = CURRENT; blkdev_dequeue_request(req); end_that_request_last(req);
That is all there is to it.
The time has come to look at how to apply all of that background
material to the task of writing better block drivers. We’ll start
with a look at the handling of clustered requests. Clustering, as
mentioned earlier, is simply the practice of joining together requests
that operate on adjacent blocks on the disk. There are two advantages
to doing things this way. First, clustering speeds up the transfer;
clustering can also save some memory in the kernel by avoiding
allocation of redundant request
structures.
As we have seen, block drivers need not be aware of clustering at all;
<linux/blk.h>
transparently splits each
clustered request into its component pieces. In many cases, however,
a driver can do better by explicitly acting on clustering. It is
often possible to set up the I/O for several consecutive blocks at the
same time, with an improvement in throughput. For example, the Linux
floppy driver attempts to write an entire track to the diskette in a
single operation. Most high-performance disk controllers can do
“scatter/gather” I/O as well, leading to large performance gains.
To take advantage of clustering, a block driver must look directly at
the list of buffer_head
structures attached to the
request. This list is pointed to by
CURRENT->bh
; subsequent buffers can be found by
following the b_reqnext
pointers in each
buffer_head
structure. A driver performing
clustered I/O should follow roughly this sequence of operations with
each buffer in the cluster:
Arrange to transfer the data block at address
bh->b_data
, of sizebh->b_size
bytes. The direction of the data transfer isCURRENT->cmd
(i.e., eitherREAD
orWRITE
).Retrieve the next buffer head in the list:
bh->b_reqnext
. Then detach the buffer just transferred from the list, by zeroing itsb_reqnext
—the pointer to the new buffer you just retrieved.Update the
request
structure to reflect the I/O done with the buffer that has just been removed. BothCURRENT->hard_nr_sectors
andCURRENT->nr_sectors
should be decremented by the number of sectors (not blocks) transferred from the buffer. The sector numbersCURRENT->hard_sector
andCURRENT->sector
should be incremented by the same amount. Performing these operations keeps therequest
structure consistent.Loop back to the beginning to transfer the next adjacent block.
When the I/O on each buffer completes, your driver should notify the kernel by calling the buffer’s I/O completion routine:
bh->b_end_io(bh, status);
status
is nonzero if the operation was successful.
You also, of course, need to remove the request
structure for the completed operations from the queue. The processing
steps just described can be done without holding the
io_request_lock
, but that lock must be reacquired
before changing the queue itself.
Your driver can still use end_request (as opposed
to manipulating the queue directly) at the completion of the I/O
operation, as long as it takes care to set the
CURRENT->bh
pointer properly. This pointer
should either be NULL
or it should point to the
last buffer_head
structure that was transferred.
In the latter case, the b_end_io function should
not have been called on that last buffer, since
end_request will make that call.
A full-featured implementation of clustering appears in
drivers/block/floppy.c
, while a summary of the
operations required appears in end_request, in
blk.h
. Neither floppy.c
nor
blk.h
are easy to understand, but the latter is a
better place to start.
One other detail regarding the behavior of the I/O request queue is relevant for block drivers that are dealing with clustering. It has to do with the queue head—the first request on the queue. For historical compatibility reasons, the kernel (almost) always assumes that a block driver is processing the first entry in the request queue. To avoid corruption resulting from conflicting activity, the kernel will never modify a request once it gets to the head of the queue. No further clustering will happen on that request, and the elevator code will not put other requests in front of it.
Many block drivers remove requests from the queue entirely before beginning to process them. If your driver works this way, the request at the head of the queue should be fair game for the kernel. In this case, your driver should inform the kernel that the head of the queue is not active by calling blk_queue_headactive:
blk_queue_headactive(request_queue_t *queue, int active);
If active
is 0, the kernel will be able to make
changes to the head of the request queue.
As we have seen, the kernel, by default, maintains a single I/O request queue for each major number. The single queue works well for devices like sbull, but it is not always optimal for real-world situations.
Consider a driver that is handling real disk devices. Each disk is capable of operating independently; the performance of the system is sure to be better if the drives could be kept busy in parallel. A simple driver based on a single queue will not achieve that—it will perform operations on a single device at a time.
It would not be all that hard for a driver to walk through the request
queue and pick out requests for independent drives. But the 2.4
kernel makes life easier by allowing the driver to set up independent
queues for each device. Most high-performance drivers take advantage
of this multiqueue capability. Doing so is not difficult, but it does
require moving beyond the simple
<linux/blk.h>
definitions.
The sbull driver, when compiled with the
SBULL_MULTIQUEUE
symbol defined, operates in a
multiqueue mode. It works without the
<linux/blk.h>
macros, and demonstrates a
number of the features that have been described in this section.
To operate in a multiqueue mode, a block driver must define its own
request queues. sbull does this by adding
a queue
member to the Sbull_Dev
structure:
request_queue_t queue; int busy;
The busy
flag is used to protect against
request function reentrancy, as we will see.
Request queues must be initialized, of course. sbull initializes its device-specific queues in this manner:
for (i = 0; i < sbull_devs; i++) { blk_init_queue(&sbull_devices[i].queue, sbull_request); blk_queue_headactive(&sbull_devices[i].queue, 0); } blk_dev[major].queue = sbull_find_queue;
The call to blk_init_queue is as we have seen before, only now we pass in the device-specific queues instead of the default queue for our major device number. This code also marks the queues as not having active heads.
You might be wondering how the kernel manages to find the request
queues, which are buried in a device-specific, private structure. The
key is the last line just shown, which sets the
queue
member in the global
blk_dev
structure. This member points to a
function that has the job of finding the proper request queue for a
given device number. Devices using the default queue have no such
function, but multiqueue devices must implement it.
sbull’s queue function looks like this:
request_queue_t *sbull_find_queue(kdev_t device) { int devno = DEVICE_NR(device); if (devno >= sbull_devs) { static int count = 0; if (count++ < 5) /* print the message at most five times */ printk(KERN_WARNING "sbull: request for unknown device\n"); return NULL; } return &sbull_devices[devno].queue; }
Like the request function, sbull_find_queue must be atomic (no sleeping allowed).
Each queue has its own request function, though
usually a driver will use the same function for all of its queues.
The kernel passes the actual request queue into the
request function as a parameter, so the function
can always figure out which device is being operated on. The
multiqueue request function used in
sbull looks a little different from the
ones we have seen so far because it manipulates the request queue
directly. It also drops the io_request_lock
while
performing transfers to allow the kernel to execute other block
operations. Finally, the code must take care to avoid two separate
perils: multiple calls of the request function
and conflicting access to the device itself.
void sbull_request(request_queue_t *q) { Sbull_Dev *device; struct request *req; int status; /* Find our device */ device = sbull_locate_device (blkdev_entry_next_request(&q->queue_head)); if (device->busy) /* no race here - io_request_lock held */ return; device->busy = 1; /* Process requests in the queue */ while(! list_empty(&q->queue_head)) { /* Pull the next request off the list. */ req = blkdev_entry_next_request(&q->queue_head); blkdev_dequeue_request(req); spin_unlock_irq (&io_request_lock); spin_lock(&device->lock); /* Process all of the buffers in this (possibly clustered) request. */ do { status = sbull_transfer(device, req); } while (end_that_request_first(req, status, DEVICE_NAME)); spin_unlock(&device->lock); spin_lock_irq (&io_request_lock); end_that_request_last(req); } device->busy = 0; }
Instead of using INIT_REQUEST
, this function tests
its specific request queue with the list function
list_empty. As long as requests exist, it
removes each one in turn from the queue with
blkdev_dequeue_request. Only then, once the
removal is complete, is it able to drop
io_request_lock
and obtain the device-specific
lock. The actual transfer is done using
sbull_transfer, which we have already seen.
Each call to sbull_transfer handles exactly one
buffer_head
structure attached to the request. The
function then calls end_that_request_first to
dispose of that buffer, and, if the request is complete, goes on to
end_that_request_last to clean up the request as
a whole.
The management of concurrency here is worth a quick look. The
busy
flag is used to prevent multiple invocations
of sbull_request. Since
sbull_request is always called with the
io_request_lock
held, it is safe to test and set
the busy
flag with no additional protection.
(Otherwise, an atomic_t
could have been used). The
io_request_lock
is dropped before the
device-specific lock is acquired. It is possible to acquire multiple
locks without risking deadlock, but it is harder; when the constraints
allow, it is better to release one lock before obtaining another.
end_that_request_first is called without the
io_request_lock
held. Since this function operates
only on the given request structure, calling it this way is
safe—as long as the request is not on the queue. The call to
end_that_request_last, however, requires that the
lock be held, since it returns the request to the request queue’s free
list. The function also always exits from the outer loop (and the
function as a whole) with the io_request_lock
held
and the device lock released.
Multiqueue drivers must, of course, clean up all of their queues at module removal time:
for (i = 0; i < sbull_devs; i++) blk_cleanup_queue(&sbull_devices[i].queue); blk_dev[major].queue = NULL;
It is worth noting, briefly, that this code could be made more
efficient. It allocates a whole set of request queues at
initialization time, even though some of them may never be used. A
request queue is a large structure, since many (perhaps thousands) of
request
structures are allocated when the queue is
initialized. A more clever implementation would allocate a request
queue when needed in either the open method or
the queue function. We chose a simpler
implementation for sbull in order to avoid
complicating the code.
That covers the mechanics of multiqueue drivers. Drivers handling real hardware may have other issues to deal with, of course, such as serializing access to a controller. But the basic structure of multiqueue drivers is as we have seen here.
Much of the discussion to this point has centered around the manipulation of the I/O request queue. The purpose of the request queue is to improve performance by allowing the driver to act asynchronously and, crucially, by allowing the merging of contiguous (on the disk) operations. For normal disk devices, operations on contiguous blocks are common, and this optimization is necessary.
Not all block devices benefit from the request queue, however. sbull, for example, processes requests synchronously and has no problems with seek times. For sbull, the request queue actually ends up slowing things down. Other types of block devices also can be better off without a request queue. For example, RAID devices, which are made up of multiple disks, often spread “contiguous” blocks across multiple physical devices. Block devices implemented by the logical volume manager (LVM) capability (which first appeared in 2.4) also have an implementation that is more complex than the block interface that is presented to the rest of the kernel.
In the 2.4 kernel, block I/O requests are placed on the queue by the function __make_request, which is also responsible for invoking the driver’s request function. Block drivers that need more control over request queueing, however, can replace that function with their own “make request” function. The RAID and LVM drivers do so, providing their own variant that, eventually, requeues each I/O request (with different block numbers) to the appropriate low-level device (or devices) that make up the higher-level device. A RAM-disk driver, instead, can execute the I/O operation directly.
sbull, when loaded with the
noqueue=1
option on 2.4 systems, will provide its
own “make request” function and operate without a request queue.
The first step in this scenario is to replace
__make_request. The “make request”
function pointer is stored in the request queue, and can be changed
with blk_queue_make_request:
void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);
The make_request_fn
type, in turn, is defined as
follows:
typedef int (make_request_fn) (request_queue_t *q, int rw, struct buffer_head *bh);
The “make request” function must arrange to transfer the given
block, and see to it that the b_end_io function
is called when the transfer is done. The kernel does
not hold the io_request_lock
lock when calling the make_request_fn function,
so the function must acquire the lock itself if it will be
manipulating the request queue. If the transfer has been set up (not
necessarily completed), the function should return 0.
The phrase “arrange to transfer” was chosen carefully; often a
driver-specific make request function will not actually transfer the
data. Consider a RAID device. What the function really needs to do is
to map the I/O operation onto one of its constituent devices, then
invoke that device’s driver to actually do the work. This mapping is
done by setting the b_rdev
member of the
buffer_head
structure to the number of the “real”
device that will do the transfer, then signaling that the block still
needs to be written by returning a nonzero value.
When the kernel sees a nonzero return value from the make request
function, it concludes that the job is not done and will try again.
But first it will look up the make request function for the device
indicated in the b_rdev
field. Thus, in the RAID
case, the RAID driver’s “make request” function will
not be called again; instead, the kernel will
pass the block to the appropriate function for the underlying device.
sbull, at initialization time, sets up its make request function as follows:
if (noqueue) blk_queue_make_request(BLK_DEFAULT_QUEUE(major), sbull_make_request);
It does not call blk_init_queue when operating in this mode, because the request queue will not be used.
When the kernel generates a request for an sbull device, it will call sbull_make_request, which is as follows:
int sbull_make_request(request_queue_t *queue, int rw, struct buffer_head *bh) { u8 *ptr; /* Figure out what we are doing */ Sbull_Dev *device = sbull_devices + MINOR(bh->b_rdev); ptr = device->data + bh->b_rsector * sbull_hardsect; /* Paranoid check; this apparently can really happen */ if (ptr + bh->b_size > device->data + sbull_blksize*sbull_size) { static int count = 0; if (count++ < 5) printk(KERN_WARNING "sbull: request past end of device\n"); bh->b_end_io(bh, 0); return 0; } /* This could be a high-memory buffer; shift it down */ #if CONFIG_HIGHMEM bh = create_bounce(rw, bh); #endif /* Do the transfer */ switch(rw) { case READ: case READA: /* Read ahead */ memcpy(bh->b_data, ptr, bh->b_size); /* from sbull to buffer */ bh->b_end_io(bh, 1); break; case WRITE: refile_buffer(bh); memcpy(ptr, bh->b_data, bh->b_size); /* from buffer to sbull */ mark_buffer_uptodate(bh, 1); bh->b_end_io(bh, 1); break; default: /* can't happen */ bh->b_end_io(bh, 0); break; } /* Nonzero return means we're done */ return 0; }
For the most part, this code should look familiar. It contains the
usual calculations to determine where the block lives within the
sbull device and uses
memcpy to perform the operation. Because the
operation completes immediately, it is able to call
bh->b_end_io
to indicate the completion of the
operation, and it returns 0 to the kernel.
There is, however, one detail that the “make request” function must take care of. The buffer to be transferred could be resident in high memory, which is not directly accessible by the kernel. High memory is covered in detail in Chapter 13. We won’t repeat the discussion here; suffice it to say that one way to deal with the problem is to replace a high-memory buffer with one that is in accessible memory. The function create_bounce will do so, in a way that is transparent to the driver. The kernel normally uses create_bounce before placing buffers in the driver’s request queue; if the driver implements its own make_request_fn, however, it must take care of this task itself.
[49] The RAM-disk driver, for example, makes its memory look as if it were in the buffer cache. Since the “disk” buffer is already in system RAM, there’s no need to keep a copy in the buffer cache. Our sample code is thus much less efficient than a properly implemented RAM disk, not being concerned with RAM-disk-specific performance issues.
Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.