We mentioned earlier in Section 4.6 that several tasks among those executed by the kernel are not critical: they can be deferred for a long period of time, if necessary. Remember that the interrupt service routines of an interrupt handler are serialized, and often there should be no occurrence of an interrupt until the corresponding interrupt handler has terminated. Conversely, the deferrable tasks can execute with all interrupts enabled. Taking them out of the interrupt handler helps keep kernel response time small. This is a very important property for many time-critical applications that expect their interrupt requests to be serviced in a few milliseconds.
Linux 2.4 answers such a challenge by using three kinds of deferrable and interruptible kernel functions (in short, deferrable functions [35]): softirqs , tasklets , and bottom halves . Although these three kinds of deferrable functions work in different ways, they are strictly correlated. Tasklets are implemented on top of softirqs, and bottom halves are implemented by means of tasklets. As a matter of fact, the term “softirq,” which appears in the kernel source code, often denotes all kinds of deferrable functions.
As a general rule, no softirq can be interrupted to run another softirq on the same CPU; the same rule holds for tasklets and bottom halves built on top of softirqs. On a multiprocessor system, however, several deferrable functions can run concurrently on different CPUs. The degree of concurrency depends on the type of deferrable function, as shown in Table 4-6.
Table 4-6. Differences between softirqs, tasklets, and bottom halves
Deferrable function |
Dynamicallocation |
Concurrency |
---|---|---|
Softirq |
No |
Softirqs of the same type can run concurrently on several CPUs. |
Tasklet |
Yes |
Tasklets of different types can run concurrently on several CPUs, but tasklets of the same type cannot. |
Bottom half |
No |
Bottom halves cannot run concurrently on several CPUs. |
Softirqs and bottom halves are statically allocated (i.e., defined at compile time), while tasklets can also be allocated and initialized at runtime (for instance, when loading a kernel module).
Many softirqs can always be executed concurrently on several CPUs, even if they are of the same type. Generally speaking, softirqs are re-entrant functions and must explicitly protect their data structures with spin locks.
Tasklets differ from softirqs because a tasklet is always serialized with respect to itself; in other words, a tasklet cannot be executed by two CPUs at the same time. However, different tasklets can be executed concurrently on several CPUs. Serializing the tasklet simplifies the life of device driver developers, since the tasklet function needs not to be re-entrant.
Finally, bottom halves are globally serialized. When one bottom half is in execution on some CPU, the other CPUs cannot execute any bottom half, even if it is of different type. This is a quite strong limitation, since it degrades the performances of the Linux kernel on multiprocessor systems. As a matter of fact, bottom halves continue to be supported by the kernel for compatibility reasons only, and device driver developers are expected to update their old drivers and replace bottom halves with tasklets. Therefore, it is likely that bottom halves will disappear in a future version of Linux.
In any case, deferrable functions must be executed serially. Any deferrable function cannot be interleaved with other deferrable functions on the same CPU.
Generally speaking, four kinds of operations can be performed on deferrable functions:
- Initialization
Defines a new deferrable function; this operation is usually done when the kernel initializes itself.
- Activation
Marks a deferrable function as “pending” — to be run in the next round of executions of the deferrable functions. Activation can be done at any time (even while handling interrupts).
- Masking
Selectively disables a deferrable function in such a way that it will not be executed by the kernel even if activated. We’ll see in Section 5.3.11 that disabling deferrable functions is sometimes essential.
- Execution
Executes a pending deferrable function together with all other pending deferrable functions of the same type; execution is performed at well-specified times, explained later in Section 4.7.1.
Activation and execution are somehow bound together: a deferrable function that has been activated by a given CPU must be executed on the same CPU. There is no self-evident reason suggesting that this rule is beneficial for system performances. Binding the deferrable function to the activating CPU could in theory make better use of the CPU hardware cache. After all, it is conceivable that the activating kernel thread accesses some data structures that will also be used by the deferrable function. However, the relevant lines could easily be no longer in the cache when the deferrable function is run because its execution can be delayed a long time. Moreover, binding a function to a CPU is always a potentially “dangerous” operation, since a CPU might end up very busy while the others are mostly idle.
Linux 2.4 uses a limited number of softirqs. For most purposes, tasklets are good enough and are much easier to write because they do not need to be re-entrant.
As a matter of fact, only the four kinds of softirqs listed in Table 4-7 are currently defined.
Table 4-7. Softirqs used in Linux 2.4
Softirq |
Index (priority) |
Description |
---|---|---|
|
0 |
Handles high-priority tasklets and bottom halves |
|
1 |
Transmits packets to network cards |
|
2 |
Receives packets from network cards |
|
3 |
Handles tasklets |
The index of a sofirq determines its priority: a lower index means higher priority because softirq functions will be executed starting from index 0.
The main data structure used to represent softirqs is the
softirq_vec
array, which includes 32 elements of
type softirq_action
. The priority of a softirq is
the index of the corresponding softirq_action
element inside the array. As shown in Table 4-7,
only the first four entries of the array are effectively used. The
softirq_action
data structure consists of two
fields: a pointer to the softirq function and a pointer to a generic
data structure that may be needed by the softirq function.
The irq_stat
array, already introduced in
Section 4.6.1.2, includes
several fields used by the kernel to implements softirqs (and also
tasklets and bottom halves, which depend on softirqs). Each element
of the array, corresponding to a given CPU, includes:
A
_ _softirq_pending
field that points to asoftirq_action
structure (the pending softirq). This field may easily be accessed through thesoftirq_pending
macro.A
_ _local_bh_count
field that disables the execution of the softirqs (as well as tasklets and bottom halves). This field may easily be accessed through thelocal_bh_count
macro. If it is set to zero, the softirqs are enabled; alternatively, if the field stores a positive integer, the softirqs are disabled. Thelocal_bh_disable
macro increments the field, while thelocal_bh_enable
macro decrements it. If the kernel invokeslocal_bh_disable
twice, it must also calllocal_bh_enable
twice to re-enable softirqs.[36]A
_ _ksoftirqd_task
field that stores the process descriptor address of a ksoftirqd_CPUn kernel thread, which is devoted to the execution of deferrable functions. (There is one such thread per CPU, and the n in ksoftiqd_CPUn represents the CPU index, as described later in Section 4.7.1.1.) This field can be accessed through theksoftirqd_task
macro.
The open_softirq( )
function takes care of softirq
initialization. It uses three parameters: the softirq index, a
pointer to the softirq function to be executed, and a second pointer
to a data structure that may be required by the softirq function.
open_softirq( )
limits itself to initialize the
proper entry of the softirq_vec
array.
Softirqs are activated by invoking by the _ _cpu_raise_softirq
macro, which receives as parameters the
CPU number cpu
and the softirq index
nr
, and sets the nr
th bit of
softirq_pending(cpu)
. The
cpu_raise_softirq( )
function is similar to the
_ _cpu_raise_softirq
macro, except that it might
also wake up the ksoftirqd_CPUn kernel thread.
Checks for pending softirqs are performed in a few points of the kernel code. Currently, this is done in the following cases (be warned that number and position of the softirq check points change both with the kernel version and with the supported hardware architecture):
When the
local_bh_enable
macro re-enables the softirqsWhen the
do_IRQ( )
function finishes handling an I/O interruptWhen the
smp_apic_timer_interrupt( )
function finishes handling a local timer interrupt (see Section 6.2.2)When one of the special ksoftirqd_CPUn kernel threads is awoken
When a packet is received on a network interface card (see Chapter 18)
In each check point, the kernel reads
softirq_pending(cpu)
; if this field is not null,
the kernel invokes do_softirq( )
to execute the
softirq functions. It performs the following actions:
Gets the logical number
cpu
of the CPU that executes the function.Returns if
local_irq_count(cpu)
is not set to zero. In this case,do_softirq( )
is invoked while terminating a nested interrupt handler, and we know that deferrable functions must run outside of interrupt service routines.Returns if
local_bh_count(cpu)
is not set to zero. In this case, all deferrable functions are disabled.Saves the state of the IF flag and clears it to disable local interrupts.
Checks the
softirq_pending(cpu)
field ofirq_stat
. If no softirqs are pending, restores the value of the IF flag saved in the previous step, and then returns.Invokes
local_bh_disable(cpu )
to increment thelocal_bh_count(cpu)
field ofirq_stat
. In this way, deferrable functions are effectively serialized on the CPU because any further invocation ofdo_softirq( )
returns without executing the softirq functions (see check at Step 3).Executes the following loop:
pending = softirq_pending(cpu); softirq_pending(cpu) = 0; mask = 0; do { mask &= ~pending; asm("sti"); for (i=0; pending; pending >>= 1, i++) if (pending & 1) softirq_vec[i].action(softirq_vec+i); asm("cli"); pending = softirq_pending(cpu); } while (pending & mask);
As you may see, the function stores the pending softirqs in the
pending
local variable, and then resets thesoftirq_pending(cpu)
field to zero. In each iteration of the loop, the function:Updates the
mask
local variable; it stores the indices of the softirqs that are already executed in this invocation of thedo_softirq( )
function.Enables local interrupts.
Executes the softirq functions of all pending softirqs (inner loop).
Disables local interrupts.
Reloads the
pending
local variable with the contents of thesoftirq_pending(cpu)
field. An interrupt handler, or even a softirq function, could have invokedcpu_raise_softirq( )
while softirq functions were executing.Performs another iteration of the loop if a softirq that has not been handled in this invocation of
do_softirq( )
is activated.
Decrements the
local_bh_count(cpu)
field, thus re-enabling the softirqs.Checks the value of the
pending
local variable. If it is not zero, a softirq that was handled in this invocation ofdo_softirq( )
is activated again. To trigger another execution of thedo_softirq( )
function, the function wakes up the ksoftirqd_CPUn kernel thread.Restores the status of IF flag (local interrupts enabled or disabled) saved in Step 4 and returns.
In recent kernel versions, each CPU has its own
ksoftirqd_CPUn kernel thread (where
n is the logical number of the CPU). Each
ksoftirqd_CPUn kernel thread runs the
ksoftirqd( )
function, which essentially executes
the following loop:
for(;;) { set_current_state(TASK_INTERRUPTIBLE); schedule( ); /* now in TASK_RUNNING state */ while (softirq_pending(cpu)) { do_softirq( ); if (current->need_resched) schedule( ); } }
When awoken, the kernel thread checks the
softirq_pending(
n
)
field and invokes, if necessary, do_softirq( )
.
The ksoftirqd_CPUn kernel threads represent a solution for a critical trade-off problem.
Softirq functions may re-activate themselves; actually, both the networking softirqs and the tasklet softirqs do this. Moreover, external events, like packet flooding on a network card, may activate softirqs at very high frequency.
The potential for a continuous high-volume flow of softirqs creates a problem that is solved by introducing kernel threads. Without them, developers are essentially faced with two alternative strategies.
The first strategy consists of ignoring new softirqs that occur while
do_softirq( )
is running. In other words, the
do_softirq( )
function determines what softirqs
are pending when the function is started, and then executes their
functions. Next, it terminates without rechecking the pending
softirqs. This solution is not good enough. Suppose that a softirq
function is re-activated during the execution of do_softirq( )
. In the worst case, the softirq is not executed again
until the next timer interrupt, even if the machine is idle. As a
result, softirq latency time is unacceptable for networking
developers.
The second strategy consists of continuously rechecking for pending
softirqs. The do_softirq( )
function keeps
checking the pending softirqs and terminates only when none of them
is pending. While this solution might satisfy networking developers,
it can certainly upset normal users of the system: if a
high-frequency flow of packets is received by a network card or a
softirq function keeps activating itself, the do_softirq( )
function never returns and the User Mode programs are
virtually stopped.
The ksoftirqd_CPUn kernel threads try to solve
this difficult trade-off problem. The do_softirq( )
function determines what softirqs are pending and
executes their functions. If an already executed softirq is activated
again, the function wakes up the kernel thread and terminates (Step 9
in of do_softirq( )
). The kernel thread has low
priority, so user programs have a chance to run; but if the machine
is idle, the pending softirqs are executed quickly.
Tasklets
are the preferred way to implement deferrable functions in I/O
drivers. As already explained, tasklets are built on top of two
softirqs named HI_SOFTIRQ
and
TASKLET_SOFTIRQ
. Several tasklets may be
associated with the same softirq, each tasklet carrying its own
function. There is no real difference between the two softirqs,
except that do_softirq( )
executes
HI_SOFTIRQ
’s tasklets before
TASKLET_SOFTIRQ
’s tasklets.
Tasklets and high-priority tasklets are stored in the
tasklet_vec
and tasklet_hi_vec
arrays, respectively. Both of them include NR_CPUS
elements of type tasklet_head
, and each element
consists of a pointer to a list of tasklet descriptors
. The tasklet descriptor is a data
structure of type tasklet_struct
, whose fields are
shown in Table 4-8.
Table 4-8. The fields of the tasklet descriptor
Field name |
Description |
---|---|
|
Pointer to next descriptor in the list |
|
Status of the tasklet |
|
Lock counter |
|
Pointer to the tasklet function |
|
An unsigned long integer that may be used by the tasklet function |
The state
field of the tasklet descriptor includes
two flags:
-
TASKLET_STATE_SCHED
When set, this indicates that the tasklet is pending (has been scheduled for execution); it also means that the tasklet descriptor is inserted in one of the lists of the
tasklet_vec
andtasklet_hi_vec
arrays.-
TASKLET_STATE_RUN
When set, this indicates that the tasklet is being executed; on a uniprocessor system this flag is not used because there is no need to check whether a specific tasklet is running.
Let’s suppose you’re writing a
device driver and you want to use a tasklet: what has to be done?
First of all, you should allocate a new
tasklet_struct
data structure and initialize it by
invoking tasklet_init( )
; this function receives
as parameters the address of the tasklet descriptor, the address of
your tasklet function, and its optional integer argument.
Your tasklet may be selectively disabled by invoking either
tasklet_disable_nosync( )
or
tasklet_disable( )
. Both functions increment the
count
field of the tasklet descriptor, but the
latter function does not return until an already running instance of
the tasklet function has terminated. To re-enable your tasklet, use
tasklet_enable( )
.
To activate the tasklet, you should invoke either the
tasklet_schedule( )
function or the
tasklet_hi_schedule( )
function, according to the
priority that you require for your tasklet. The two functions are
very similar; each of them performs the following actions:
Checks the
TASKLET_STATE_SCHED
flag; if it is set, returns (the tasklet has already been scheduled)Gets the logical number of the CPU that is executing the function
Saves the state of the IF flag and clears it to disable local interrupts
Adds the tasklet descriptor at the beginning of the list pointed to by
tasklet_vec[cpu]
ortasklet_hi_vec[cpu]
Invokes
cpu_raise_softirq( )
to activate either theTASKLET_SOFTIRQ
softirq or theHI_SOFTIRQ
softirqRestores the value of the IF flag saved in Step 3 (local interrupts enabled or disabled)
Finally, let’s see how your tasklet is executed. We
know from the previous section that, once activated, softirq
functions are executed by the do_softirq( )
function. The softirq function associated with the
HI_SOFTIRQ
softirq is named
tasklet_hi_action( )
, while the function
associated with TASKLET_SOFTIRQ
is named
tasklet_action( )
. Once again, the two functions
are very similar; each of them:
Gets the logical number of the CPU that is executing the function.
Disables local interrupts, saving the previous state of the IF flag.
Stores the address of the list pointed to by
tasklet_vec[cpu]
ortasklet_hi_vec[cpu]
in thelist
local variable.Puts a
NULL
address intasklet_vec[cpu]
ortasklet_hi_vec[cpu]
; thus, the list of scheduled tasklet descriptors is emptied.Enables local interrupts.
For each tasklet descriptor in the list pointed to by
list
:In multiprocessor systems, checks the
TASKLET_STATE_RUN
flag of the tasklet. If it is set, a tasklet of the same type is already running on another CPU, so the function reinserts the task descriptor in the list pointed to bytasklet_vec[cpu]
ortasklet_hi_vec[cpu]
and activates theTASKLET_SOFTIRQ
orHI_SOFTIRQ
softirq again. In this way, execution of the tasklet is deferred until other tasklets of the same type are running on other CPUs.If the
TASKLET_STATE_RUN
flag is not set, the tasklet is not running on other CPUs. In multiprocessor systems, the function sets the flag so that the tasklet function cannot be executed on other CPUs.Checks whether the tasklet is disabled by looking at the
count
field of the tasklet descriptor. If it is disabled, it reinserts the task descriptor in the list pointed to bytasklet_vec[cpu]
ortasklet_hi_vec[cpu]
; then the function activates theTASKLET_SOFTIRQ
orHI_SOFTIRQ
softirq again.If the tasklet is enabled, clears the
TASKLET_STATE_SCHED
flag and executes the tasklet function.
Notice that, unless the tasklet function re-activates itself, every tasklet activation triggers at most one execution of the tasklet function.
A bottom half is essentially a
high-priority tasklet that cannot be executed concurrently with any
other bottom half, even if it is of a different type and on another
CPU. The global_bh_lock
spin lock is used to
ensure that at most one bottom half is running.
Linux uses an array called the bh_base
table to
group all bottom halves together. It is an array of pointers to
bottom halves and can include up to 32 entries, one for each type of
bottom half. In practice, Linux uses about half of them; the types
are listed in Table 4-9. As you can see from the
table, some of the bottom halves are associated with hardware devices
that are not necessarily installed in the system or that are specific
to platforms besides the IBM PC compatible. But
TIMER_BH
, TQUEUE_BH
,
SERIAL_BH
, and IMMEDIATE_BH
still see widespread use. We describe the
TQUEUE_BH
and IMMEDIATE_BH
bottom half later in this chapter and the TIMER_BH
bottom half in Chapter 6.
Table 4-9. The Linux bottom halves
Bottom half |
Peripheral device |
---|---|
|
Timer |
|
Periodic task queue |
|
DigiBoard PC/Xe |
|
Serial port |
|
RISCom/8 |
|
Specialix IO8+ |
|
Aurora multiport card (SPARC) |
|
Hayes ESP serial card |
|
SCSI interface |
|
Immediate task queue |
|
Cyclades Cyclom-Y serial multiport |
|
CD-ROM Philips/LMS cm206 disk |
|
Power Macintosh’s serial port |
|
MultiTech’s ISI cards |
The bh_task_vec
array stores 32 tasklet
descriptors, one for each bottom half. During kernel initialization,
these tasklet descriptors are initialized in the following way:
for (i=0; i<32; ++i) tasklet_init(bh_task_vec+i, bh_action, i);
As usual, before a bottom half is invoked for the first time, it must
be initialized. This is done by invoking
init_bh(n,
routine)
, which
inserts the routine
address as the n
th entry of bh_base
. Conversely,
remove_bh(n)
removes the n th
bottom half from the table.
Bottom-half activation is done by invoking mark_bh( )
. Since bottom halves are high-priority tasklets,
mark_bh(n)
just reduces to
tasklet_hi_schedule(bh_task_vec + n)
.
The bh_action( )
function is the tasklet function
common to all bottom halves. It receives as a parameter the index of
the bottom half and performs the following steps:
Gets the logical number of the CPU executing the tasklet function.
Checks whether the
global_bh_lock
spin lock has already been acquired. In this case, another CPU is running a bottom half. The function invokesmark_bh( )
to re-activate the bottom half and returns.Otherwise, the function acquires the
global_bh_lock
spin lock so that no other bottom half can be executed in the system.Checks that the
local_irq_count
field is set to zero (bottom halves are supposed to be run outside interrupt service routines), and that global interrupts are enabled (see Chapter 5). If one of these cases doesn’t hold, the function releases theglobal_bh_lock
spin lock and terminates.Invokes the bottom half function stored in the proper entry of the
bh_base
array.Releases the
global_bh_lock
spin lock and returns.
The motivation for introducing deferrable functions is to allow a limited number of functions related to interrupt handling to be executed in a deferred manner. This approach has been stretched in two directions:
To allow not only a function that services an interrupt, but also a generic kernel function to be executed as a bottom half
To allow several kernel functions, instead of a single one, to be associated with a bottom half
Groups of functions are represented by task queues
, which are lists of
tq_struct
structures whose fields are shown in Table 4-10.
Table 4-10. The fields of the tq_struct structure
Field name |
Description |
---|---|
|
Links for doubly linked list |
|
Used to prevent multiple activations |
|
Function to call |
|
Argument for the function |
As we shall see in Chapter 13, I/O device drivers use task queues to require the execution of several related functions when a specific interrupt occurs.
The DECLARE_TASK_QUEUE
macro allocates a new task
queue, while queue_task( )
inserts a new function
in a task queue. The run_task_queue( )
function
executes all the functions included in a given task queue.
It’s worth mentioning three particular task queues:
The
tq _immediate
task queue, run by theIMMEDIATE_BH
bottom half, includes kernel functions to be executed together with the standard bottom halves. The kernel invokesmark_bh( )
to activate theIMMEDIATE_BH
bottom half whenever a function is added to thetq _immediate
task queue. It is executed as soon asdo_softirq( )
is invoked.The
tq _timer
task queue is run by theTQUEUE_BH
bottom half, which is activated at every timer interrupt. As we’ll see in Chapter 6, that means it runs about every 10 ms.The
tq_context
task queue is not associated with a bottom half, but it is run by the keventd kernel thread. Theschedule_task( )
function adds a function to the task queue; its execution is deferred until the scheduler selects the keventd kernel thread as the next process to run.The main advantage of
tq_context
, with respect to the other task queues based on deferrable functions, is that its functions can freely perform blocking operations. On the other hand, softirqs (and therefore tasklets and bottom halves) are similar to interrupt handlers in that kernel developers cannot make any assumption on the process that will execute the deferrable functions. From a practical point of view, this means that softirqs cannot perform blocking operations like accessing a file, acquiring a semaphore, or sleeping in a wait queue.The price to pay is that, once scheduled for execution in
tq_context
, a function might be delayed for quite a long time interval.
[35] These are also called software interrupt s, but we denote them as “deferrable functions” to avoid confusion with programmed exceptions, which are referred to as “software interrupts” in Intel manuals.
[36] Better names for these two macros could be
local_softirq_disable
and
local_softirq_enable
. The actual names are
vestiges of old kernel versions.
Get Understanding the Linux Kernel, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.