We mentioned earlier in the section "Interrupt Handling" that several tasks among those executed by the kernel are not critical: they can be deferred for a long period of time, if necessary. Remember that the interrupt service routines of an interrupt handler are serialized, and often there should be no occurrence of an interrupt until the corresponding interrupt handler has terminated. Conversely, the deferrable tasks can execute with all interrupts enabled. Taking them out of the interrupt handler helps keep kernel response time small. This is a very important property for many time-critical applications that expect their interrupt requests to be serviced in a few milliseconds.
Linux 2.6 answers such a challenge by using two kinds of non-urgent interruptible kernel functions: the so-called deferrable functions [*] (softirqs and tasklets ), and those executed by means of some work queues (we will describe them in the section "Work Queues" later in this chapter).
Softirqs and tasklets are strictly correlated, because tasklets are implemented on top of softirqs. As a matter of fact, the term “softirq,” which appears in the kernel source code, often denotes both kinds of deferrable functions. Another widely used term is interrupt context : it specifies that the kernel is currently executing either an interrupt handler or a deferrable function.
Softirqs are statically allocated (i.e., defined at compile time), while tasklets can also be allocated and initialized at runtime (for instance, when loading a kernel module). Softirqs can run concurrently on several CPUs, even if they are of the same type. Thus, softirqs are reentrant functions and must explicitly protect their data structures with spin locks. Tasklets do not have to worry about this, because their execution is controlled more strictly by the kernel. Tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time. However, tasklets of different types can be executed concurrently on several CPUs. Serializing the tasklet simplifies the life of device driver developers, because the tasklet function needs not be reentrant.
Generally speaking, four kinds of operations can be performed on deferrable functions:
- Initialization
Defines a new deferrable function; this operation is usually done when the kernel initializes itself or a module is loaded.
- Activation
Marks a deferrable function as “pending” — to be run the next time the kernel schedules a round of executions of deferrable functions. Activation can be done at any time (even while handling interrupts).
- Masking
Selectively disables a deferrable function so that it will not be executed by the kernel even if activated. We’ll see in the section "Disabling and Enabling Deferrable Functions" in Chapter 5 that disabling deferrable functions is sometimes essential.
- Execution
Executes a pending deferrable function together with all other pending deferrable functions of the same type; execution is performed at well-specified times, explained later in the section "Softirqs.”
Activation and execution are bound together: a deferrable function that has been activated by a given CPU must be executed on the same CPU. There is no self-evident reason suggesting that this rule is beneficial for system performance. Binding the deferrable function to the activating CPU could in theory make better use of the CPU hardware cache. After all, it is conceivable that the activating kernel thread accesses some data structures that will also be used by the deferrable function. However, the relevant lines could easily be no longer in the cache when the deferrable function is run because its execution can be delayed a long time. Moreover, binding a function to a CPU is always a potentially “dangerous” operation, because one CPU might end up very busy while the others are mostly idle.
Linux 2.6 uses a limited number of softirqs . For most purposes, tasklets are good enough and are much easier to write because they do not need to be reentrant.
As a matter of fact, only the six kinds of softirqs listed in Table 4-9 are currently defined.
Table 4-9. Softirqs used in Linux 2.6
Softirq | Index (priority) | Description |
---|---|---|
| 0 | Handles high priority tasklets |
| 1 | Tasklets related to timer interrupts |
| 2 | Transmits packets to network cards |
| 3 | Receives packets from network cards |
| 4 | Post-interrupt processing of SCSI commands |
| 5 | Handles regular tasklets |
The index of a sofirq determines its priority: a lower index means higher priority because softirq functions will be executed starting from index 0.
The main data structure used to represent softirqs is
the softirq_vec
array, which
includes 32 elements of type softirq_action
. The priority of a softirq
is the index of the corresponding softirq_action
element inside the array.
As shown in Table
4-9, only the first six entries of the array are effectively
used. The softirq_action
data
structure consists of two fields: an action
pointer to the softirq function and
a data
pointer to a generic data
structure that may be needed by the softirq function.
Another critical field used to keep track both of kernel
preemption and of nesting of kernel control paths is the 32-bit preempt_count
field stored in the thread_info
field of each process
descriptor (see the section "Identifying a Process"
in Chapter 3). This field
encodes three distinct counters plus a flag, as shown in Table 4-10.
Table 4-10. Subfields of the preempt_count field (continues)
Bits | Description |
---|---|
0–7 | Preemption counter (max value = 255) |
8–15 | Softirq counter (max value = 255). |
16–27 | Hardirq counter (max value = 4096) |
28 | |
The first counter keeps track of how many times kernel
preemption has been explicitly disabled on the local CPU; the value
zero means that kernel preemption has not been explicitly disabled
at all. The second counter specifies how many levels deep the
disabling of deferrable functions is (level 0 means that deferrable
functions are enabled). The third counter specifies the number of
nested interrupt handlers on the local CPU (the value is increased
by irq_enter( )
and decreased by
irq_exit( )
; see the section
"I/O Interrupt
Handling" earlier in this chapter).
There is a good reason for the name of the preempt_count
field: kernel preemptability
has to be disabled either when it has been explicitly disabled by
the kernel code (preemption counter not zero) or when the kernel is
running in interrupt context. Thus, to determine whether the current
process can be preempted, the kernel quickly checks for a zero value
in the preempt_count
field.
Kernel preemption will be discussed in depth in the section "Kernel Preemption" in
Chapter 5.
The in_interrupt( )
macro
checks the hardirq and softirq counters in the current_thread_info( )->preempt_count
field. If either one of these two counters is positive, the macro
yields a nonzero value, otherwise it yields the value zero. If the
kernel does not make use of multiple Kernel Mode stacks, the macro
always looks at the preempt_count
field of the thread_info
descriptor of the current process. If, however, the kernel makes use
of multiple Kernel Mode stacks, the macro might look at the preempt_count
field in the thread_info
descriptor contained in a
irq_ctx
union associated with the
local CPU. In this case, the macro returns a nonzero value because
the field is always set to a positive value.
The last crucial data structure for implementing the softirqs
is a per-CPU 32-bit mask describing the pending softirqs; it is
stored in the _ _softirq_pending
field of the irq_cpustat_t
data
structure (recall that there is one such structure per each CPU in
the system; see Table
4-8). To get and set the value of the bit mask, the kernel
makes use of the local_softirq_pending(
)
macro that selects the softirq bit mask of the local
CPU.
The open_softirq( )
function takes care of softirq initialization. It uses three
parameters: the softirq index, a pointer to the softirq function to
be executed, and a second pointer to a data structure that may be
required by the softirq function. open_softirq( )
limits itself to
initializing the proper entry of the softirq_vec
array.
Softirqs are activated by means of the raise_softirq( )
function. This function,
which receives as its parameter the softirq index nr
, performs the following actions:
Executes the
local_irq_save
macro to save the state of theIF
flag of theeflags
register and to disable interrupts on the local CPU.Marks the softirq as pending by setting the bit corresponding to the index
nr
in the softirq bit mask of the local CPU.If
in_interrupt()
yields the value 1, it jumps to step 5. This situation indicates either thatraise_softirq( )
has been invoked in interrupt context, or that the softirqs are currently disabled.Otherwise, invokes
wakeup_softirqd()
to wake up, if necessary, the ksoftirqd kernel thread of the local CPU (see later).Executes the
local_irq_restore
macro to restore the state of theIF
flag saved in step 1.
Checks for active (pending) softirqs should be perfomed periodically, but without inducing too much overhead. They are performed in a few points of the kernel code. Here is a list of the most significant points (be warned that number and position of the softirq checkpoints change both with the kernel version and with the supported hardware architecture):
When the kernel invokes the
local_bh_enable( )
function[*] to enable softirqs on the local CPUWhen the
do_IRQ( )
function finishes handling an I/O interrupt and invokes theirq_exit( )
macroIf the system uses an I/O APIC, when the
smp_apic_timer_interrupt( )
function finishes handling a local timer interrupt (see the section "Timekeeping Architecture in Multiprocessor Systems" in Chapter 6)In multiprocessor systems, when a CPU finishes handling a function triggered by a
CALL_FUNCTION_VECTOR
interprocessor interruptWhen one of the special ksoftirqd/n kernel threads is awakened (see later)
If pending softirqs are detected at one such
checkpoint (local_softirq_pending()
is not zero), the
kernel invokes do_softirq( )
to
take care of them. This function performs the following
actions:
If
in_interrupt( )
yields the value one, this function returns. This situation indicates either thatdo_softirq( )
has been invoked in interrupt context or that the softirqs are currently disabled.Executes
local_irq_save
to save the state of theIF
flag and to disable the interrupts on the local CPU.If the size of the
thread_union
structure is 4 KB, it switches to the soft IRQ stack, if necessary. This step is very similar to step 2 ofdo_IRQ( )
in the earlier section "I/O Interrupt Handling;” of course, thesoftirq_ctx
array is used instead ofhardirq_ctx
.Invokes the
_ _do_softirq( )
function (see the following section).If the soft IRQ stack has been effectively switched in step 3 above, it restores the original stack pointer into the
esp
register, thus switching back to the exception stack that was in use before.Executes
local_irq_restore
to restore the state of theIF
flag (local interrupts enabled or disabled) saved in step 2 and returns.
The _ _do_softirq(
)
function reads the softirq bit mask of the local CPU and
executes the deferrable functions corresponding to every set bit.
While executing a softirq function, new pending softirqs might pop
up; in order to ensure a low latency time for the deferrable
funtions, _ _do_softirq( )
keeps
running until all pending softirqs have been executed. This
mechanism, however, could force _ _do_softirq( )
to run for long periods of
time, thus considerably delaying User Mode processes. For that
reason, _ _do_softirq( )
performs
a fixed number of iterations and then returns. The remaining pending
softirqs, if any, will be handled in due time by the
ksoftirqd kernel thread described in the next
section. Here is a short description of the actions performed by the
function:
Initializes the iteration counter to 10.
Copies the softirq bit mask of the local CPU (selected by
local_softirq_pending( )
) in thepending
local variable.Invokes
local_bh_disable( )
to increase the softirq counter. It is somewhat counterintuitive that deferrable functions should be disabled before starting to execute them, but it really makes a lot of sense. Because the deferrable functions mostly run with interrupts enabled, an interrupt can be raised in the middle of the_ _do_softirq( )
function. Whendo_IRQ( )
executes theirq_exit( )
macro, another instance of the_ _do_softirq( )
function could be started. This has to be avoided, because deferrable functions must execute serially on the CPU. Thus, the first instance of_ _do_softirq( )
disables deferrable functions, so that every new instance of the function will exit at step 1 ofdo_softirq( )
.Clears the softirq bitmap of the local CPU, so that new softirqs can be activated (the value of the bit mask has already been saved in the
pending
local variable in step 2).Executes
local_irq_enable( )
to enable local interrupts.For each bit set in the
pending
local variable, it executes the corresponding softirq function; recall that the function address for the softirq with indexn
is stored insoftirq_vec[n]->action
.Executes
local_irq_disable()
to disable local interrupts.Copies the softirq bit mask of the local CPU into the
pending
local variable and decreases the iteration counter one more time.If
pending
is not zero—at least one softirq has been activated since the start of the last iteration—and the iteration counter is still positive, it jumps back to step 4.If there are more pending softirqs, it invokes
wakeup_softirqd( )
to wake up the kernel thread that takes care of the softirqs for the local CPU (see next section).Subtracts 1 from the softirq counter, thus reenabling the deferrable functions.
In recent kernel versions, each CPU has its own
ksoftirqd/n kernel thread (where
n is the logical number of the CPU). Each
ksoftirqd/n kernel thread runs the ksoftirqd( )
function, which essentially
executes the following loop:
for(;;) { set_current_state(TASK_INTERRUPTIBLE ); schedule( ); /* now in TASK_RUNNING state */ while (local_softirq_pending( )) { preempt_disable(); do_softirq( ); preempt_enable(); cond_resched( ); } }
When awakened, the kernel thread checks the local_softirq_pending()
softirq bit mask
and invokes, if necessary, do_softirq(
)
. If there are no softirqs pending, the function puts the
current process in the TASK_INTERRUPTIBLE
state and invokes then
the cond_resched()
function to
perform a process switch if required by the current process (flag
TIF_NEED_RESCHED
of the current
thread_info
set).
The ksoftirqd/n kernel threads represent a solution for a critical trade-off problem.
Softirq functions may reactivate themselves; in fact, both the networking softirqs and the tasklet softirqs do this. Moreover, external events, such as packet flooding on a network card, may activate softirqs at very high frequency.
The potential for a continuous high-volume flow of softirqs creates a problem that is solved by introducing kernel threads. Without them, developers are essentially faced with two alternative strategies.
The first strategy consists of ignoring new softirqs that
occur while do_softirq( )
is
running. In other words, the do_softirq(
)
function could determine what softirqs are pending when
the function is started and then execute their functions. Next, it
would terminate without rechecking the pending softirqs. This
solution is not good enough. Suppose that a softirq function is
reactivated during the execution of do_softirq( )
. In the worst case, the
softirq is not executed again until the next timer interrupt, even
if the machine is idle. As a result, softirq latency time is
unacceptable for networking developers.
The second strategy consists of continuously rechecking for
pending softirqs. The do_softirq(
)
function could keep checking the pending softirqs and
would terminate only when none of them is pending. While this
solution might satisfy networking developers, it can certainly upset
normal users of the system: if a high-frequency flow of packets is
received by a network card or a softirq function keeps activating
itself, the do_softirq( )
function never returns, and the User Mode programs are virtually
stopped.
The ksoftirqd/n kernel threads try to
solve this difficult trade-off problem. The do_softirq( )
function determines what
softirqs are pending and executes their functions. After a few
iterations, if the flow of softirqs does not stop, the function
wakes up the kernel thread and terminates (step 10 of _ _do_softirq( )
). The kernel thread has low
priority, so user programs have a chance to run; but if the machine
is idle, the pending softirqs are executed quickly.
Tasklets are the preferred way to implement deferrable
functions in I/O drivers. As already explained, tasklets are built on top of two softirqs named HI_SOFTIRQ
and TASKLET_SOFTIRQ
. Several tasklets may be
associated with the same softirq, each tasklet carrying its own
function. There is no real difference between the two softirqs, except
that do_softirq( )
executes
HI_SOFTIRQ
’s tasklets before
TASKLET_SOFTIRQ
’s tasklets.
Tasklets and high-priority tasklets are stored in the tasklet_vec
and tasklet_hi_vec
arrays, respectively. Both of
them include NR_CPUS
elements of
type tasklet_head
, and each element
consists of a pointer to a list of tasklet
descriptors. The tasklet descriptor is a data structure of
type tasklet_struct
, whose fields
are shown in Table
4-11.
Table 4-11. The fields of the tasklet descriptor
Field name | Description |
---|---|
| Pointer to next descriptor in the list |
| Status of the tasklet |
| Lock counter |
| Pointer to the tasklet function |
| An unsigned long integer that may be used by the tasklet function |
The state
field of the
tasklet descriptor includes two flags:
TASKLET_STATE_SCHED
When set, this indicates that the tasklet is pending (has been scheduled for execution); it also means that the tasklet descriptor is inserted in one of the lists of the
tasklet_vec
andtasklet_hi_vec
arrays.TASKLET_STATE_RUN
When set, this indicates that the tasklet is being executed; on a uniprocessor system this flag is not used because there is no need to check whether a specific tasklet is running.
Let’s suppose you’re writing a device driver and you want to use
a tasklet: what has to be done? First of all, you should allocate a
new tasklet_struct
data structure
and initialize it by invoking tasklet_init(
)
; this function receives as its parameters the address of
the tasklet descriptor, the address of your tasklet function, and its
optional integer argument.
The tasklet may be selectively disabled by invoking either
tasklet_disable_nosync( )
or
tasklet_disable( )
. Both functions
increase the count
field of the
tasklet descriptor, but the latter function does not return until an
already running instance of the tasklet function has terminated. To
reenable the tasklet, use tasklet_enable(
)
.
To activate the tasklet, you should invoke either the tasklet_schedule( )
function or the tasklet_hi_schedule( )
function, according
to the priority that you require for the tasklet. The two functions
are very similar; each of them performs the following actions:
Checks the
TASKLET_STATE_SCHED
flag; if it is set, returns (the tasklet has already been scheduled).Invokes
local_irq_save
to save the state of theIF
flag and to disable local interrupts.Adds the tasklet descriptor at the beginning of the list pointed to by
tasklet_vec[n]
ortasklet_hi_vec[n]
, wheren
denotes the logical number of the local CPU.Invokes
raise_softirq_irqoff( )
to activate either theTASKLET_SOFTIRQ
or theHI_SOFTIRQ
softirq (this function is similar toraise_softirq( )
, except that it assumes that local interrupts are already disabled).Invokes
local_irq_restore
to restore the state of theIF
flag.
Finally, let’s see how the tasklet is executed. We know from the
previous section that, once activated, softirq functions are executed
by the do_softirq( )
function. The
softirq function associated with the HI_SOFTIRQ
softirq is named tasklet_hi_action( )
, while the function
associated with TASKLET_SOFTIRQ
is
named tasklet_action( )
. Once
again, the two functions are very similar; each of them:
Disables local interrupts.
Gets the logical number
n
of the local CPU.Stores the address of the list pointed to by
tasklet_vec[n]
ortasklet_hi_vec[n]
in thelist
local variable.Puts a
NULL
address intasklet_vec[n]
ortasklet_hi_vec[n]
, thus emptying the list of scheduled tasklet descriptors.Enables local interrupts.
For each tasklet descriptor in the list pointed to by
list
:In multiprocessor systems, checks the
TASKLET_STATE_RUN
flag of the tasklet.If it is set, a tasklet of the same type is already running on another CPU, so the function reinserts the task descriptor in the list pointed to by
tasklet_vec[n]
ortasklet_hi_vec[n]
and activates theTASKLET_SOFTIRQ
orHI_SOFTIRQ
softirq again. In this way, execution of the tasklet is deferred until no other tasklets of the same type are running on other CPUs.Otherwise, the tasklet is not running on another CPU: sets the flag so that the tasklet function cannot be executed on other CPUs.
Checks whether the tasklet is disabled by looking at the
count
field of the tasklet descriptor. If the tasklet is disabled, it clears itsTASKLET_STATE_RUN
flag and reinserts the task descriptor in the list pointed to bytasklet_vec[n]
ortasklet_hi_vec[n]
; then the function activates theTASKLET_SOFTIRQ
orHI_SOFTIRQ
softirq again.If the tasklet is enabled, it clears the
TASKLET_STATE_SCHED
flag and executes the tasklet function.
Notice that, unless the tasklet function reactivates itself, every tasklet activation triggers at most one execution of the tasklet function.
[*] These are also called software interrupts, but we denote them as “deferrable functions” to avoid confusion with programmed exceptions, which are referred to as “software interrupts " in Intel manuals.
[*] The name local_bh_enable(
)
refers to a special type of deferrable function
called “bottom half” that no longer exists in Linux
2.6.
Get Understanding the Linux Kernel, 3rd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.