We have already seen race conditions come up a number of times in the previous chapters. Whereas race conditions can happen at any time on SMP systems, uniprocessor systems, to this point, have had to worry about them rather less.[38] Interrupts, however, can bring with them a whole new set of race conditions, even on uniprocessor systems. Since an interrupt can happen at any time, it can cause the interrupt handler to be executed in the middle of an arbitrary piece of driver code. Thus, any device driver that is working with interrupts—and that is most of them—must be very concerned with race conditions. For this reason, we look more closely at race conditions and their prevention in this chapter.
Dealing with race conditions is one of the trickiest aspects of programming, because the related bugs are subtle and very difficult to reproduce, and it’s hard to tell when there is a race condition between interrupt code and the driver methods. The programmer must take great care to avoid corruption of data or metadata.
Different techniques can be employed to prevent data corruption, and we will introduce the most common ones. We won’t show complete code because the best code for each situation depends on the operating mode of the device being driven, and on the programmer’s taste. All of the drivers in this book, however, protect themselves against race conditions, so examples can be found in the sample code.
The most common ways of protecting data from concurrent access are as follows:
Using a circular buffer and avoiding shared variables
Using spinlocks to enforce mutual exclusion
Using lock variables that are atomically incremented and decremented
Note that semaphores are not listed here. Because locking a semaphore can put a process to sleep, semaphores may not be used in interrupt handlers.
Whatever approach you choose, you still need to decide what to do when
accessing a variable that can be modified at interrupt time. In simple
cases, such a variable can simply be declared as
volatile
to prevent the compiler from optimizing
access to its value (for example, it prevents the compiler from
holding the value in a register for the whole duration of a
function). However, the compiler generates suboptimal code whenever
volatile
variables are involved, so you might
choose to resort to some sort of locking instead. In more complicated
situations, there is no choice but to use some sort of locking.
Using a circular buffer is an effective way of handling concurrent-access problems; the best way to deal with concurrent access is to perform no concurrent access whatsoever.
The circular buffer uses an algorithm called “producer and consumer”: one player pushes data in and the other pulls data out. Concurrent access is avoided if there is exactly one producer and exactly one consumer. There are two examples of producer and consumer in short. In one case, the reading process is waiting to consume data that is produced at interrupt time; in the other, the bottom half consumes data produced by the top half.
Two pointers are used to address a circular buffer:
head
and tail
.
head
is the point at which data is being written
and is updated only by the producer of the data. Data is being read
from tail
, which is updated only by the consumer.
As mentioned earlier, if data is written at interrupt time, you must
be careful when accessing head
multiple times. You
should either declare it as volatile
or use some
sort of locking.
The circular buffer runs smoothly, except when it fills up. If that
happens, things become hairy, and you can choose among different
possible solutions. The short
implementation just loses data; there’s no check for overflow, and if
head
goes beyond tail
, a whole
buffer of data is lost. Some alternative implementations are to drop
the last item; to overwrite the buffer tail, as
printk does (see Section 4.1.2
in Chapter 4); to hold up the producer, as
scullpipe does; or to allocate a temporary
extra buffer to back up the main buffer. The best solution depends on
the importance of your data and other situation-specific questions, so
we won’t cover it here.
Although the circular buffer appears to solve the problem of concurrent access, there is still the possibility of a race condition when the read function goes to sleep. This code shows where the problem appears in short:
while (short_head == short_tail) { interruptible_sleep_on(&short_queue); /* ... */ }
When executing this statement, it is possible that new data will
arrive after the while
condition is evaluated as true and before the
process goes to sleep. Information carried in by the interrupt won’t
be read by the process; the process goes to sleep even though
head != tail
, and it isn’t awakened until the next
data item arrives.
We didn’t implement correct locking for short because the source of short_read is included in Section 8.3.2 in Chapter 8, and at that point this discussion was not worth introducing. Also, the data involved is not worth the effort.
Although the data that short collects is not vital, and the likelihood of getting an interrupt in the time lapse between two successive instructions is often negligible, sometimes you just can’t take the risk of going to sleep when data is pending. This problem is general enough to deserve special treatment and is delayed to Section 9.8.4 later in this chapter, where we’ll discuss it in detail.
It’s interesting to note that only a producer-and-consumer situation can be addressed with a circular buffer. A programmer must often deal with more complex data structures to solve the concurrent-access problem. The producer/consumer situation is actually the simplest class of these problems; other structures, such as linked lists, simply don’t lend themselves to a circular buffer implementation.
We have seen spinlocks before, for example, in the scull driver. The discussion thus far has looked only at a few uses of spinlocks; in this section we cover them in rather more detail.
A spinlock, remember, works through a shared variable. A function may acquire the lock by setting the variable to a specific value. Any other function needing the lock will query it and, seeing that it is not available, will “spin” in a busy-wait loop until it is available. Spinlocks thus need to be used with care. A function that holds a spinlock for too long can waste much time because other CPUs are forced to wait.
Spinlocks are represented by the type spinlock_t
,
which, along with the various spinlock functions, is declared in
<asm/spinlock.h>
. Normally, a spinlock is
declared and initialized to the unlocked state with a line like:
spinlock_t my_lock = SPIN_LOCK_UNLOCKED;
If, instead, it is necessary to initialize a spinlock at runtime, use spin_lock_init:
spin_lock_init(&my_lock);
There are a number of functions (actually macros) that work with spinlocks:
-
spin_lock(spinlock_t *lock);
Acquire the given lock, spinning if necessary until it is available. On return from spin_lock, the calling function owns the lock.
-
spin_lock_irqsave(spinlock_t *lock, unsigned long flags);
This version also acquires the lock; in addition, it disables interrupts on the local processor and stores the current interrupt state in
flags
. Note that all of the spinlock primitives are defined as macros, and that theflags
argument is passed directly, not as a pointer.-
spin_lock_irq(spinlock_t *lock);
This function acts like spin_lock_irqsave, except that it does not save the current interrupt state. This version is slightly more efficient than spin_lock_irqsave, but it should only be used in situations in which you know that interrupts will not have already been disabled.
-
spin_lock_bh(spinlock_t *lock);
Obtains the given lock and prevents the execution of bottom halves.
-
spin_unlock(spinlock_t *lock);
,spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);
,spin_unlock_irq(spinlock_t *lock);
,spin_unlock_bh(spinlock_t *lock);
These functions are the counterparts of the various locking primitives described previously. spin_unlock unlocks the given lock and nothing else. spin_unlock_irqrestore possibly enables interrupts, depending on the
flags
value (which should have come from spin_lock_irqsave). spin_unlock_irq enables interrupts unconditionally, and spin_unlock_bh reenables bottom-half processing. In each case, your function should be in possession of the lock before calling one of the unlocking primitives, or serious disorder will result.-
spin_is_locked(spinlock_t *lock);
,spin_trylock(spinlock_t *lock)
,spin_unlock_wait(spinlock_t *lock);
spin_is_locked queries the state of a spinlock without changing it. It returns nonzero if the lock is currently busy. To attempt to acquire a lock without waiting, use spin_trylock, which returns nonzero if the operation failed (the lock was busy). spin_unlock_wait waits until the lock becomes free, but does not take possession of it.
Many users of spinlocks stick to spin_lock and spin_unlock. If you are using spinlocks in interrupt handlers, however, you must use the IRQ-disabling versions (usually spin_lock_irqsave and spin_unlock_irqsave) in the noninterrupt code. To do otherwise is to invite a deadlock situation.
It is worth considering an example here. Assume that your driver is running in its read method, and it obtains a lock with spin_lock. While the read method is holding the lock, your device interrupts, and your interrupt handler is executed on the same processor. If it attempts to use the same lock, it will go into a busy-wait loop, since your read method already holds the lock. But, since the interrupt routine has preempted that method, the lock will never be released and the processor deadlocks, which is probably not what you wanted.
This problem can be avoided by using
spin_lock_irqsave to disable interrupts on the
local processor while the lock is held. When in doubt, use the
_irqsave versions of the primitives and you will
not need to worry about deadlocks. Remember, though, that the
flags
value from
spin_lock_irqsave must not be passed to other
functions.
Regular spinlocks work well for most situations encountered by device driver writers. In some cases, however, there is a particular pattern of access to critical data that is worth treating specially. If you have a situation in which numerous threads (processes, interrupt handlers, bottom-half routines) need to access critical data in a read-only mode, you may be worried about the overhead of using spinlocks. Numerous readers cannot interfere with each other; only a writer can create problems. In such situations, it is far more efficient to allow all readers to access the data simultaneously.
Linux has a different type of spinlock, called a
reader-writer spinlock for this case. These locks
have a type of rwlock_t
and should be initialized
to RW_LOCK_UNLOCKED
. Any number of threads can
hold the lock for reading at the same time. When a writer comes
along, however, it waits until it can get exclusive access.
The functions for working with reader-writer locks are as follows:
-
read_lock(rwlock_t *lock);
,read_lock_irqsave(rwlock_t *lock, unsigned long flags);
,read_lock_irq(rwlock_t *lock);
,read_lock_bh(rwlock_t *lock);
-
read_unlock(rwlock_t *lock);
,read_unlock_irqrestore(rwlock_t *lock, unsigned long flags);
,read_unlock_irq(rwlock_t *lock);
,read_unlock_bh(rwlock_t *lock);
-
write_lock(rwlock_t *lock);
,write_lock_irqsave(rwlock_t *lock, unsigned long flags);
,write_lock_irq(rwlock_t *lock);
,write_lock_bh(rwlock_t *lock);
-
write_unlock(rwlock_t *lock);
,write_unlock_irqrestore(rwlock_t *lock, unsigned long flags);
,write_unlock_irq(rwlock_t *lock);
,write_unlock_bh(rwlock_t *lock);
If your interrupt handler uses read locks only, then all of your code may acquire read locks with read_lock and not disable interrupts. Any write locks must be acquired with write_lock_irqsave, however, to avoid deadlocks.
It is worth noting that in kernels built for uniprocessor systems, the spinlock functions expand to nothing. They thus have no overhead (other than possibly disabling interrupts) on those systems, where they are not needed.
The kernel provides a set of functions that may be used to provide atomic (noninterruptible) access to variables. Use of these functions can occasionally eliminate the need for a more complicated locking scheme, when the operations to be performed are very simple. The atomic operations may also be used to provide a sort of “poor person’s spinlock” by manually testing and looping. It is usually better, however, to use spinlocks directly, since they have been optimized for this purpose.
The Linux kernel exports two sets of functions to deal with locks: bit operations and access to the “atomic” data type.
It’s quite common to have single-bit lock variables or to update device status flags at interrupt time—while a process may be accessing them. The kernel offers a set of functions that modify or test single bits atomically. Because the whole operation happens in a single step, no interrupt (or other processor) can interfere.
Atomic bit operations are very fast, since they perform the operation
using a single machine instruction without disabling interrupts
whenever the underlying platform can do that. The functions are
architecture dependent and are declared in
<asm/bitops.h>
. They are guaranteed to be
atomic even on SMP computers and are useful to keep coherence across
processors.
Unfortunately, data typing in these functions is architecture
dependent as well. The nr
argument is mostly
defined as int
but is unsigned long
for a few architectures. Here is the list of bit
operations as they appear in 2.1.37 and later:
-
void set_bit(nr, void *addr);
This function sets bit number
nr
in the data item pointed to byaddr
. The function acts on anunsigned long
, even thoughaddr
is a pointer tovoid
.-
void clear_bit(nr, void *addr);
The function clears the specified bit in the
unsigned long
datum that lives ataddr
. Its semantics are otherwise the same as set_bit.-
void change_bit(nr, void *addr);
-
test_bit(nr, void *addr);
This function is the only bit operation that doesn’t need to be atomic; it simply returns the current value of the bit.
-
int test_and_set_bit(nr, void *addr);
,int test_and_clear_bit(nr, void *addr);
,int test_and_change_bit(nr, void *addr);
These functions behave atomically like those listed previously, except that they also return the previous value of the bit.
When these functions are used to access and modify a shared flag, you don’t have to do anything except call them. Using bit operations to manage a lock variable that controls access to a shared variable, on the other hand, is more complicated and deserves an example. Most modern code will not use bit operations in this way, but code like the following still exists in the kernel.
A code segment that needs to access a shared data item tries to
atomically acquire a lock using either
test_and_set_bit or
test_and_clear_bit. The usual implementation is
shown here; it assumes that the lock lives at bit
nr
of address addr
. It also
assumes that the bit is either 0 when the lock is free or nonzero
when the lock is busy.
/* try to set lock */ while (test_and_set_bit(nr, addr) != 0) wait_for_a_while(); /* do your work */ /* release lock, and check... */ if (test_and_clear_bit(nr, addr) == 0) something_went_wrong(); /* already released: error */
If you read through the kernel source, you will find code that works
like this example. As mentioned before, however, it is better to use
spinlocks in new code, unless you need to perform useful work while
waiting for the lock to be released (e.g., in the
wait_for_a_while()
instruction of this listing).
Kernel programmers often need to share an integer variable between an
interrupt handler and other functions. A separate set of functions
has been provided to facilitate this sort of sharing; they are defined
in <asm/atomic.h>
.
The facility offered by atomic.h
is much stronger
than the bit operations just described. atomic.h
defines a new data type, atomic_t
, which can be
accessed only through atomic operations. An
atomic_t
holds an int
value on
all supported architectures. Because of the way this type works on
some processors, however, the full integer range may not be available;
thus, you should not count on an atomic_t
holding
more than 24 bits. The following operations are defined for the type
and are guaranteed to be atomic with respect to all processors of an
SMP computer. The operations are very fast because they compile to a
single machine instruction whenever possible.
-
void atomic_set(atomic_t *v, int i);
-
int atomic_read(atomic_t *v);
-
void atomic_add(int i, atomic_t *v);
Add
i
to the atomic variable pointed to byv
. The return value isvoid
, because most of the time there’s no need to know the new value. This function is used by the networking code to update statistics about memory usage in sockets.-
void atomic_sub(int i, atomic_t *v);
-
void atomic_inc(atomic_t *v);
,void atomic_dec(atomic_t *v);
-
int atomic_inc_and_test(atomic_t *v);
,int atomic_dec_and_test(atomic_t *v);
,int atomic_add_and_test(int i, atomic_t *v);
,int atomic_sub_and_test(int i, atomic_t *v);
These functions behave like their counterparts listed earlier, but they also return the previous value of the atomic data type.
As stated earlier, atomic_t
data items must be
accessed only through these functions. If you pass an atomic item to a
function that expects an integer argument, you’ll get a compiler
error.
The one race condition that has been omitted so far in this discussion is the problem of going to sleep. Generally stated, things can happen in the time between when your driver decides to sleep and when the sleep_on call is actually performed. Occasionally, the condition you are sleeping for may come about before you actually go to sleep, leading to a longer sleep than expected. It is a problem far more general than interrupt-driven I/O, and an efficient solution requires a little knowledge of the internals of sleep_on.
As an example, consider again the following code from the short driver:
while (short_head == short_tail) { interruptible_sleep_on(&short_queue); /* ... */ }
In this case, the value of short_head
could change
between the test in the while
statement and the
call to interruptible_sleep_on. In that case,
the driver will sleep even though new data is available; this
condition leads to delays in the best case, and a lockup of the device
in the worst.
The way to solve this problem is to go halfway to sleep before performing the test. The idea is that the process can add itself to the wait queue, declare itself to be sleeping, and then perform its tests. This is the typical implementation:
wait_queue_t wait; init_waitqueue_entry(&wait, current); add_wait_queue(&short_queue, &wait); while (1) { set_current_state(TASK_INTERRUPTIBLE); if (short_head != short_tail) /* whatever test your driver needs */ break; schedule(); } set_current_state(TASK_RUNNING); remove_wait_queue(&short_queue, &wait);
This code is somewhat like an unrolling of the internals of sleep_on; we’ll step through it here.
The code starts by declaring a wait_queue_t
variable, initializing it, and adding it to the driver’s wait queue
(which, as you may remember, is of type
wait_queue_head_t
). Once these steps have been
performed, a call to wake_up on
short_queue
will wake this process.
The process is not yet asleep, however. It gets closer to that state
with the call to set_current_state, which sets
the process’s state to TASK_INTERRUPTIBLE
. The
rest of the system now thinks that the process is asleep, and the
scheduler will not try to run it. This is an important step in the
“going to sleep” process, but things still are not done.
What happens now is that the code tests for the condition for which it is waiting, namely, that there is data in the buffer. If no data is present, a call to schedule is made, causing some other process to run and truly putting the current process to sleep. Once the process is woken up, it will test for the condition again, and possibly exit from the loop.
Beyond the loop, there is just a bit of cleaning up to do. The
current state is set to TASK_RUNNING
to reflect the
fact that we are no longer asleep; this is necessary because if we
exited the loop without ever sleeping, we may still be in
TASK_INTERRUPTIBLE
. Then
remove_wait_queue is used to take the process off
the wait queue.
So why is this code free of race conditions? When new data comes in,
the interrupt handler will call wake_up on
short_queue
, which has the effect of setting the
state of every sleeping process on the queue to
TASK_RUNNING
. If the wake_up
call happens after the buffer has been tested, the state of the task
will be changed and schedule will cause the
current process to continue running—after a short delay, if not
immediately.
This sort of “test while half asleep” pattern is so common in the kernel source that a pair of macros was added during 2.1 development to make life easier:
[38] Note, however, that the kernel developers are seriously considering making all kernel code preemptable at almost any time, making locking mandatory even on uniprocessor systems.
Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.