Even if you’ve used all the monitoring and debugging techniques, sometimes bugs remain in the driver and the system faults when the driver is executed. When this happens it’s important to be able to collect as much information as possible to solve the problem.
Note that ``fault'' doesn’t mean ``panic.'' The Linux code is robust enough to respond gracefully to most errors: a fault usually results in the destruction of the current process, but the system goes on working. The system can panic, and it may if a fault happens outside of a process’s context, or if some vital part of the system is compromised. But when the problem is due to a driver error, it usually results only in the sudden destruction of the faulty process--the one using the driver. The only unrecoverable damage when a process is destroyed is that some memory allocated to the process’s context can be lost; for instance, dynamic lists allocated by the driver through kmalloc might be lost. However, since the kernel calls the close operation for any open device anyway, your driver can release what was allocated by the open method.
We’ve already said that when kernel code misbehaves, an informative message is printed on the console. The next section explains how to decode and use such messages. Even though they appear rather obscure to the novice, processor dumps are full of interesting information, often sufficient to pinpoint a program bug without the need for additional testing.
Most bugs show themselves in NULL
pointer
dereferences or by the use of other incorrect pointer values. The usual
outcome of such bugs is an oops message.
Any address used by the processor is a ``virtual'' address and is mapped to physical addresses through a complex structure of so-called page tables (see Section 13.1.1 in Chapter 13). When an invalid pointer is dereferenced, the paging mechanism fails to map the pointer to a physical address and the processor signals a ``page fault'' to the operating system. If the address is not valid, the kernel is not able to ``page-in'' the missing address; it generates an ``oops'' if this happens while the processor is in supervisor mode. It’s interesting to note that the way the kernel deals with faults changed in version 2.1, so that it can handle references to invalid addresses while in supervisor mode. The new implementation is described in the section Section 17.10, in Chapter 17.
An oops displays the processor status at the time of the
fault, including the contents of the CPU registers, the location of
page descriptor tables, and other seemingly incomprehensible
information. The message is generated by printk
statements in the fault handler (arch/*/kernel/traps.c
)
and is dispatched as described earlier in the section Section 4.1.1.
Let’s look at one such message. Here’s how an oops appears on a conventional personal computer (an x86 platform), running Linux 2.0 or newer--version 1.2 has a slightly different layout.
Unable to handle kernel paging request at virtual address c202e000 current->tss.cr3 = 012c0000, $r3 = 012c0000 *pde = 00001067 *pte = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<0202d079>] EFLAGS: 00010216 eax: 00000041 ebx: 00001000 ecx: 0000004b edx: 0156b018 esi: 0202e000 edi: 0800aed4 ebp: 01106f90 esp: 01106f6c ds: 0018 es: 002b fs: 002b gs: 002b ss: 0018 Process cat (pid: 597, process nr: 31, stackpage=01106000) Stack: 02020018 0202d004 00f992e8 01257440 0800a000 00001000 01257440 00001000 00f992e8 00120e7a 00f992e8 01257440 0800a000 00001000 0800a000 00001000 bffffbd8 0010a602 00000003 0800a000 00001000 Call Trace: [<02020018>] [<0202d004>] [<00120e7a>] [<0010a602>] Code: f3 a5 83 e3 03 89 d9 f3 a4 07 8b 45 14 8d 65 f4 5b 5e 5f 89
The message above was generated by running cat on a
faulty module, built deliberately to demonstrate the error.
faulty.c
includes the following code:
char faulty_buf[1024]; read_write_t faulty_read (struct inode *inode, struct file *filp, char *buf, count_t count) { printk(KERN_DEBUG "read: inode %p, file %p, buf %p, count %li\n", inode, filp, buf, (long)count); memcpy_tofs(buf,faulty_buf,count); return count; }
Since read copies data to user space from its
small buffer (faulty_buf
), we can expect reading the file
in small pieces to work.
Reading more than one kilobyte at a time, on the other
hand, might cross a page boundary, and the read will fail if it
accesses an invalid page. Indeed, the oops shown earlier
happened during a
read call asking for 4 kilobytes, as shown by the line appearing
before the oops message in /var/log/messages
(the default file
where syslogd stores kernel messages):
read: inode 00f992e8, file 01257440, buf 0800a000, count 4096
The same cat command doesn’t generate an oops on an Alpha
because reading 4 kilobytes from faulty_buf
doesn’t cross the
page boundary (pages are 8KB on the Alpha and the buffer sits
near the beginning of the page). If reading faulty doesn’t generate
an oops message on your system, try using wc (word count) instead,
or specify an explicit block size to dd.
The main problem with oops messages is that the hex values are not meaningful for the programmer; they need to be resolved to symbols.
The kernel sources help the developer by including the ksymoops utility--but note that the program was missing from version 1.2 of the sources. The tool resolves numeric addresses found in the oops message to kernel symbols, but only for oops messages generated by PCs. Each architecture has its own message format, because the available information is processor-dependent.
ksymoops gets the oops message on standard
input and the name of the kernel symbol table on the command line. The
symbol table is usually /usr/src/ linux/System.map
. The
program prints the call trace and program code in a more readable
format than the original oops message. The following snapshot was
produced by feeding ksymoops the oops message shown in
the previous section:
Trace: 2020018 Trace: 202d004 Trace: 120e7a <sys_read+8a/b0> Trace: 10a602 <system_call+52/80> Code: repz movsl %ds:(%esi),%es:(%edi) Code: andl $0x3,%ebx Code: movl %ebx,%ecx Code: repz movsb %ds:(%esi),%es:(%edi) Code: popl %es Code: movl 0x14(%ebp),%eax Code: leal 0xfffffff4(%ebp),%esp Code: popl %ebx Code: popl %esi Code: popl %edi Code: movl %eax,(%eax)
The code disassembly produced by ksymoops shows the instruction
that failed and the following ones. It’s apparent here--for
those who know a little assembler--that the repz movsl
instruction (REPeat till cx is Zero, MOVe a String of Longs) hit an
unmapped page with the source index (esi
, shown as 0x202e000
).
The command ksyms -m, used to retrieve module information,
shows that the
module is mapped to a single page at 0x0202d
xxx
,
thus confirming that esi
is out of range.
The decoded call trace still includes two numeric addresses, because
the memory area occupied by the faulty module isn’t
described in the system map. The values can be supplemented manually,
either by inspecting the output of the ksyms command
or by grepping for the module name in /proc/ksyms
.
For this particular fault, however, the two addresses
don’t correspond to code addresses. If you look in
arch/i386/kernel/traps.c
, you’ll find that the call trace is
extracted from the whole stack dump by using some heuristics to
look at the memory layout and distinguish between data values (local
variables and function arguments) and return addresses. Only
addresses that refer to kernel code and the ones that might
refer to modules are shown in the call trace. Since module pages contain both
code and data, extraneous stack frames
can slip through the heuristics, and this is exactly what
happened for the two 0x202
xxxx
addresses above.
If you’d rather not look for module addresses by hand, the following pipeline can be used to create a new symbol table that encompasses both kernel and module symbols. Whenever you reload the module, you must recreate this symbol table.
cat /proc/ksyms /usr/src/linux/System.map | sed 's/ . / /' |\ awk '{print $1,"T",$2}' | sort -u > /tmp/System.map
The pipeline combines the complete system map and the public
kernel symbols from /proc/ksyms
; the latter file lists module
symbols in the current kernel, in addition to kernel symbols.
Such addresses are shown as they appear after insmod has relocated the
code. Since the two files have different formats, sed and
awk are used to convert all the lines into a suitable format.
The map is then sorted, removing duplicates, so that ksymoops can
use it.
If we rerun ksymoops, it extracts the following information from the new symbol table:
>>EIP: 202d079 <faulty_read+45/60> Trace: 2020018 <M_wacom_proto+1fb8/d8e4> Trace: 202d004 <kmouse_wait+15d8/1608> Trace: 120e7a <sys_read+8a/b0> Trace: 10a602 <system_call+52/80> Code: 202d079 <faulty_read+45/60> repz movsl %ds:(%esi),%es:(%edi) Code: 202d07b <faulty_read+47/60> andl $0x3,%ebx Code: 202d07e <faulty_read+4a/60> movl %ebx,%ecx Code: 202d080 <faulty_read+4c/60> repz movsb %ds:(%esi),%es:(%edi) Code: 202d082 <faulty_read+4e/60> popl %es Code: 202d083 <faulty_read+4f/60> movl 0x14(%ebp),%eax Code: 202d086 <faulty_read+52/60> leal 0xfffffff4(%ebp),%esp Code: 202d089 <faulty_read+55/60> popl %ebx Code: 202d08a <faulty_read+56/60> popl %esi Code: 202d08b <faulty_read+57/60> popl %edi Code: 202d08c <faulty_read+58/60> movl %eax,(%eax)
As you can see, creating a modified system map is quite helpful when tracing oops messages related to modules: ksymoops can now decode the instruction pointer and the complete call trace. Note also that the format used to show disassembled code is the same as that used by the objdump program. objdump is a powerful utility; if you want to look at instructions before the one that failed, you can invoke the command objdump -d faulty.o.
In the resulting assembly listing of the file, the string
faulty_read+45/60
marks the faulty line. For more information
on objdump and its command-line options, see the man page for the
command.
Even if you build your own modified symbol table, the concern mentioned
above regarding the call trace applies:
the 0x202
xxxx
pointers have been
decoded, but are still spurious.
Learning to decode an oops message requires some practice, but it’s worth doing. The time spent learning will be quickly repaid. The only issue is where to get the relevant documentation about assembly language, because the Unix syntax for machine instructions is different from the Intel syntax; even if you know PC assembly language, your experience has probably been with Intel-syntax programs.
Using ksymoops is somewhat burdensome. You need the C++ compiler to build it, you must build your own symbol table to fully exploit the capabilities of the program, and you have to merge the original message and the ksymoops output to have all the information handy.
If you don’t want to go to all that trouble, you can use the
oops program. oops is provided in the source
files for this book on the
O’Reilly FTP site. It is derived from the original ksymoops
tool, which is no longer maintained by its author. oops is
written in C and looks in /proc/ksyms
without requiring the
user to build a new symbol table every time a module is loaded.
The program tries to decode all the processor registers and the
stack trace to symbolic values. Its disadvantage is that it is more
verbose than ksymoops, but usually the more information you
have available, the sooner you find the bug.
Like the original ksyms
, the program is able to
decode oops messages generated on Intel platforms only. Porting to other
platforms shouldn’t be difficult. The program is
released under the GPL like the kernel sources.
Output generated by oops is similar to the ksymoops output, but more complete. Here is the beginning of its output for the oops shown above--I don’t think it’s worth showing the entire stack trace because the stack holds nothing interesting in this particular oops message:
EIP: 0010:0202d079 <faulty_read+45/60> EFLAGS: 00010216 eax: 00000041 ebx: 00001000 ecx: 0000004b edx: 0156b018 esi: 0202e000 edi: 0800aed4 ebp: 01106f90 <%esp+24> esp: 01106f6c <%esp+0> ds: 0018 es: 002b fs: 002b gs: 002b ss: 0018 Process cat (pid: 597, process nr: 31, stackpage=01106000) esp+00: 02020018 <M_wacom_proto+1fb8/d8e4> esp+04: 0202d004 esp+08: 00f992e8
Having the registers and the stack decoded is helpful when you are
debugging ``real'' modules (faulty is too
short for the difference to be meaningful), and is
particularly useful if all the symbols of the module being debugged
are exported. It’s not unusual for processor registers to point to
symbols in the module at the time of the fault, and you can identify
them from the output
only if the symbol table is exported to /proc/ksyms
.
We can make a complete symbol table available by taking the following
steps. First, we won’t declare static symbols in the module, since they
wouldn’t be exported by insmod. And second, we can mask the call
to register_symtab with
#ifdef SCULL_DEBUG
or a similar macro, as shown in the code
below, extracted from scull’s init_module function.
#ifndef SCULL_DEBUG register_symtab(NULL); /* otherwise, leave global symbols visible */ #endif
We saw in Section 2.3.1 in Chapter 2, that if the module doesn’t register a symbol table,
all the global symbols are exported. Although this feature is
exploited only if SCULL_DEBUG
is active, all global
symbols should be correctly prefixed to avoid namespace pollution in
the kernel (see Section 2.1 in Chapter 2).
Recent versions of the klogd daemon can decode oops
messages before they reach the log files. Decoding is performed only
by version 1.3 or newer of the daemon and only if -k
/usr/src/linux/System.map is passed as a
command-line option to the daemon. (You can replace System.map
with
another map file.)
A dump of the oops for faulty, produced by the new klogd and written to the system log, looks like this (note the decoded symbols in the stack trace):
EIP: 0010:[<0202d079>] EFLAGS: 00010216 eax: 00000041 ebx: 00001000 ecx: 0000004b edx: 00ee2414 esi: 0202e000 edi: 0800aed4 ebp: 0032ff90 esp: 0032ff6c ds: 0018 es: 002b fs: 002b gs: 002b ss: 0018 Process cat (pid: 861, process nr: 10, stackpage=0032f000) Stack: 02020018 0202d004 00f992e8 01257c40 0800a000 00001000 01257c40 00001000 00f992e8 00120e7a 00f992e8 01257c40 0800a000 00001000 0800a000 00001000 bffffbd8 0010a602 00000003 0800a000 00001000 Call Trace: [<02020018>] [<0202d004>] [sys_read+138/176] [system_call+82/128] Code: f3 a5 83 e3 03 89 d9 f3 a4 07 8b 45 14 8d 65 f4 5b 5e 5f 89
I consider the decoding klogd a great utility for the average Linux installation to help in debugging the kernel, but we’ll see how it is of less use for debugging modules. The message decoded by klogd includes most ksymoops features and doesn’t force the user to compile additional tools or merge two outputs in order to submit a complete bug report should something go wrong with the system. The daemon also correctly decodes the instruction pointer when the oops happens in the kernel proper. It doesn’t disassemble the program code, but this is not a problem when the message accompanies a bug report because the binary data is still there, and disassembled code can be generated offline.
Another great feature of the daemon is that it refuses to decode symbols if the symbol table doesn’t match the current kernel. If a symbol is decoded on the system log, you can be reasonably sure it is decoded correctly.
However, the tool, despite its usefulness for Linux users, is not that
helpful when debugging modules. I personally don’t use the decoding
options on the computers where I develop my software. The problem with
klogd is that it doesn’t decode symbols in modules; even reading
/proc/ksyms
doesn’t help, because the daemon is run before the
programmer loads the module. The presence of decoded
symbols in the log file, then, confuses both oops and
ksymoops, and it’s hard to perform additional decoding.
If you want to use klogd for debugging your modules, specific support is being added to the newest versions of the daemon, but as I’m writing this, it needs a small kernel patch to be effective.
Although most bugs in kernel code end up as oops messages, sometimes they can completely hang the system. If the system hangs, no message is printed. For example, if the code enters an endless loop, the kernel stops scheduling, and the system doesn’t respond to any action, including the magic Ctrl-Alt-Del combination. You have two choices to deal with system hangs--either you need to prevent them beforehand, or you need to be able to debug them after the fact.
You can prevent an endless loop by inserting schedule invocations at strategic points. The schedule call (as you might guess) invokes the scheduler and thus allows other processes to steal CPU time from the current process. If a process is looping in kernel space due to a bug in your driver, you will be able to kill the process, after tracing what is happening.
Inserting schedule calls in a driver function creates a new ``problem'' for the programmer: the function, and all the functions in its call trace, must be reentrant. Under normal circumstances, the driver as a whole is reentrant because different processes can access the device concurrently, but it’s not necessary for each function to be reentrant, because the Linux kernel is not preemptible. But if a driver function allows the scheduler to interrupt the current process, a different process might enter the same function. The reentrancy issue is not really important if the schedule calls are enabled only during debugging, because you can avoid accessing the driver from two concurrent processes if you know you aren’t allowed to. When blocking operations are introduced (in Section 5.2.2 in Chapter 5), the reentrancy problem will be dealt with in more detail.
To debug infinite loops, we can make use of the special functions of the Linux keyboard. By default, the PrScr key (keycode 70), if pressed with a modifier key, prints to the current console useful information about the machine’s status. This works on both x86 and Alpha systems. The Sparc port of Linux features the same capability, but uses the key marked ``Break/Scroll Lock'' (keycode 30).
Each of the special functions has a name and is associated with a keypress event, as shown in the following list. The function name appears in parentheses after the key combination.
-
Shift-PrScr (
Show_Memory
) Prints several lines of information about memory usage, particularly the use of the buffer cache.
-
Control-PrScr (
Show_State
) Prints one line for each process in the system, with information about the internal process tree. The current process is marked as such.
-
RightAlt-PrScr (
Show_Registers
) This is the most important key when the system hangs, because it dumps the contents of the processor registers at the time the key is hit. Looking at the instruction pointer and how it changes over time can be extremely useful in understanding where the code is looping, provided there exists a system map for the current kernel.
The name of each function can be passed to loadkeys in order to remap the binding to a different key. The keyboard map can be modified at will (it is ``policy-free'').
The messages printed by these functions
appear on the console if console_loglevel
is
high enough. The default level should be high enough, unless you run an old
klogd and a new kernel. If the messages don’t appear, you can
raise the loglevel as explained earlier. The definition of ``high enough''
depends on the kernel version you use. It’s 5 for Linux 2.0 and later.
It’s important to be sure the loglevel is high enough, because the messages will display on the console even when the computer is hung. The messages are generated at the time of the interrupt and therefore can slip through even if a faulty process is executing a tight loop without releasing the CPU--that is, unless interrupts are disabled, which is both unlikely and unlucky.
Sometimes the system may appear to be hung, but it isn’t. This can happen,
for example, if the keyboard remains locked in some strange way.
These false hangs can be detected by looking at the output of a
program you keep running for just this purpose. I have a program
that updates the clock on an LED display, and I discovered that the
program is also useful as evidence that the scheduler is still working.
You can check the scheduler without using external
hardware, by implementing a program that flashes
the keyboard LEDs, turns on the floppy motor every now and then, or
ticks the speaker--conventional beeps are quite annoying and should
be avoided, in my opinion. Look for the KDMKTONE
ioctl
command instead. A sample program (misc-progs/heartbeat.c
)
that flashes a keyboard LED in a
heartbeat fashion is available in the sources on the O’Reilly FTP site.
If the keyboard isn’t accepting input, the best thing to do is log into the system through your network in order to kill any offending processes, or reset the keyboard (with kbd_mode -a). However, discovering that the hang is only a keyboard lockup is of little use if you don’t have a network available to help you recover. If this is the case, you should set up alternative input devices to be able at least to reboot the system cleanly. A shutdown and reboot cycle is easier on your computer than hitting the so-called ``big red button,'' and it saves you from the lengthy fsck scanning of your filesystems.
Such alternative input devices can be a joystick or the mouse. There is a joystick-reboot daemon on sunsite.unc.edu, and the gpm-1.10 or newer mouse server features a command-line option to enable a similar capability. If the keyboard is erroneously in ``raw'' mode instead of being locked, you can resort to the tricks described in the documentation of the kbd package. I suggest that you read the documentation before the problem arises and it’s too late. Another possibility is to configure the gpm-root menus to have a ``reboot'' or ``reset keyboard'' entry; gpm-root is a daemon that responds to control-mouse events in order to draw menus on the screen and perform configurable actions.
Finally, you can hit the ``Secure Attention
Key'' (SAK), a special key meant to recover the system to a
usable state. The current Linux versions don’t have an entry for the
key in the default keyboard map because the implementation is not
guaranteed always to succeed. You can nonetheless map SAK to
your keyboard by using loadkeys. You should also look at
the implementation
of SAK in the drivers/char
directory. The comments in
the code explain why the key doesn’t always work with Linux 2.0,
so I won’t say any more about it.
If you run 2.1.9 or newer, on the other hand, you’ll enjoy having a reliable Secure Attention Key. Moreover, 2.1.43 and newer kernels have a compile-time option to enable a ``Magic System Request Key''; I urge you to look in drivers/char/sysrq.c and enjoy the new technology.
If your driver really hangs the system, and you don’t know
where to insert schedule calls, the best way to go is to
add some print messages and write them to the console (by changing the
console_loglevel
value). It’s also wise to mount all your
disks read-only (or unmount them) before reproducing the hang. If the
disks are read-only or unmounted, there’s no risk of damaging the
filesystem or leaving it in an inconsistent state. At least you’ll
avoid the fsck pass after resetting the
computer. Another possibility is using an NFS-root computer to test
new modules. In this case you’ll avoid any filesystem corruption, as
filesystem coherence is managed by the NFS server, which is not
brought down by your device driver.
Get Linux Device Drivers now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.