Even if you’ve used all the monitoring and debugging techniques, sometimes bugs remain in the driver, and the system faults when the driver is executed. When this happens it’s important to be able to collect as much information as possible to solve the problem.
Note that “fault” doesn’t mean “panic.” The Linux code is robust enough to respond gracefully to most errors: a fault usually results in the destruction of the current process while the system goes on working. The system can panic, and it may if a fault happens outside of a process’s context, or if some vital part of the system is compromised. But when the problem is due to a driver error, it usually results only in the sudden death of the process unlucky enough to be using the driver. The only unrecoverable damage when a process is destroyed is that some memory allocated to the process’s context is lost; for instance, dynamic lists allocated by the driver through kmalloc might be lost. However, since the kernel calls the close operation for any open device when a process dies, your driver can release what was allocated by the open method.
We’ve already said that when kernel code misbehaves, an informative message is printed on the console. The next section explains how to decode and use such messages. Even though they appear rather obscure to the novice, processor dumps are full of interesting information, often sufficient to pinpoint a program bug without the need for additional testing.
Most bugs show themselves in NULL
pointer
dereferences or by the use of other incorrect pointer values. The
usual outcome of such bugs is an oops message.
Any address used by the processor is a virtual address and is mapped to physical addresses through a complex structure of so-called page tables (see Section 13.1.4 in Chapter 13). When an invalid pointer is dereferenced, the paging mechanism fails to map the pointer to a physical address and the processor signals a page fault to the operating system. If the address is not valid, the kernel is not able to “page in” the missing address; it generates an oops if this happens while the processor is in supervisor mode.
It’s worth noting that the first enhancement introduced after version 2.0 was automatic handling of invalid address faults when moving data to and from user space. Linus chose to let the hardware catch erroneous memory references, so that the normal case (where the addresses are correct) is handled more efficiently.
An oops displays the processor status at the time of the fault,
including the contents of the CPU registers, the location of page
descriptor tables, and other seemingly incomprehensible
information. The message is generated by printk
statements in the fault handler
(arch/*/kernel/traps.c
) and is dispatched as
described earlier, in Section 4.1.1.
Let’s look at one such message. Here’s what results from dereferencing
a NULL
pointer on a PC running version 2.4 of the
kernel. The most relevant information here is the instruction pointer
(EIP
), the address of the faulty instruction.
Unable to handle kernel NULL pointer dereference at virtual address \ 00000000 printing eip: c48370c3 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[<c48370c3>] EFLAGS: 00010286 eax: ffffffea ebx: c2281a20 ecx: c48370c0 edx: c2281a40 esi: 4000c000 edi: 4000c000 ebp: c38adf8c esp: c38adf8c ds: 0018 es: 0018 ss: 0018 Process ls (pid: 23171, stackpage=c38ad000) Stack: 0000010e c01356e6 c2281a20 4000c000 0000010e c2281a40 c38ac000 \ 0000010e 4000c000 bffffc1c 00000000 00000000 c38adfc4 c010b860 00000001 \ 4000c000 0000010e 0000010e 4000c000 bffffc1c 00000004 0000002b 0000002b \ 00000004 Call Trace: [<c01356e6>] [<c010b860>] Code: c7 05 00 00 00 00 00 00 00 00 31 c0 89 ec 5d c3 8d b6 00 00
This message was generated by writing to a device owned by the
faulty module, a module built deliberately
to demonstrate failures. The implementation of the
write method of faulty.c
is
trivial:
ssize_t faulty_write (struct file *filp, const char *buf, size_t count, loff_t *pos) { /* make a simple fault by dereferencing a NULL pointer */ *(int *)0 = 0; return 0; }
As you can see, what we do here is dereference a
NULL
pointer. Since 0 is never a valid pointer
value, a fault occurs, which the kernel turns into the oops message
shown earlier. The calling process is then killed.
The faulty module has more interesting fault conditions in its read implementation:
char faulty_buf[1024]; ssize_t faulty_read (struct file *filp, char *buf, size_t count, loff_t *pos) { int ret, ret2; char stack_buf[4]; printk(KERN_DEBUG "read: buf %p, count %li\n", buf, (long)count); /* the next line oopses with 2.0, but not with 2.2 and later */ ret = copy_to_user(buf, faulty_buf, count); if (!ret) return count; /* we survived */ printk(KERN_DEBUG "didn't fail: retry\n"); /* For 2.2 and 2.4, let's try a buffer overflow */ sprintf(stack_buf, "1234567\n"); if (count > 8) count = 8; /* copy 8 bytes to the user */ ret2 = copy_to_user(buf, stack_buf, count); if (!ret2) return count; return ret2; }
It first reads from a global buffer without checking the size of the
data, and then performs a buffer overrun by writing to a local buffer.
The first situation results in an oops only in version 2.0 of the
kernel, because later versions automatically deal with user copy
functions. The buffer overflow results in an oops with all kernel
versions; however, since the return
instruction
brings the instruction pointer to nowhere land, this kind of fault is
much harder to trace, and you can get something like the following:
EIP: 0010:[<00000000>] [...] Call Trace: [<c010b860>] Code: Bad EIP value.
The main problem with users dealing with oops messages is in the
little intrinsic meaning carried by hexadecimal values; to be
meaningful to the programmer they need to be resolved to symbols. A
couple of utilities are available to perform this resolution for
developers: klogd and
ksymoops. The former tool performs symbol
decoding by itself whenever it is running; the latter needs to be
purposely invoked by the user. In the following discussion we use the
data generated in our first oops example by dereferencing a
NULL
pointer.
The klogd daemon can decode oops messages before they reach the log files. In many situations, klogd can provide all the information a developer needs to track down a problem, though sometimes the developer must give it a little help.
A dump of the oops for faulty, as it
reaches the system log, looks like this (note the decoded symbols on
the EIP
line and in the stack trace):
Unable to handle kernel NULL pointer dereference at virtual address \ 00000000 printing eip: c48370c3 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[faulty:faulty_write+3/576] EFLAGS: 00010286 eax: ffffffea ebx: c2c55ae0 ecx: c48370c0 edx: c2c55b00 esi: 0804d038 edi: 0804d038 ebp: c2337f8c esp: c2337f8c ds: 0018 es: 0018 ss: 0018 Process cat (pid: 23413, stackpage=c2337000) Stack: 00000001 c01356e6 c2c55ae0 0804d038 00000001 c2c55b00 c2336000 \ 00000001 0804d038 bffffbd4 00000000 00000000 bffffbd4 c010b860 00000001 \ 0804d038 00000001 00000001 0804d038 bffffbd4 00000004 0000002b 0000002b \ 00000004 Call Trace: [sys_write+214/256] [system_call+52/56] Code: c7 05 00 00 00 00 00 00 00 00 31 c0 89 ec 5d c3 8d b6 00 00
klogd provides most of the necessary
information to track down the problem. In this case we see that the
instruction pointer (EIP
) was executing in the
function faulty_write, so we know where to start
looking. The 3/576
string tells us that the
processor was at byte 3 of a function that appears to be 576 bytes
long. Note that the values are decimal, not hex.
The developer must exercise some care, however, to get useful
information for errors that occur within loadable modules.
klogd loads all of the available symbol
information when it starts, and uses those symbols thereafter. If you
load a module after klogd has initialized
itself (usually at system boot), klogd will
not have your module’s symbol information. To force
klogd to go out and get that information,
send the klogd process a
SIGUSR1
signal after your module has been loaded
(or reloaded), and before you do anything that could cause it to oops.
It is also possible to run klogd with the -p (“paranoid”) option, which will cause it to reread symbol information anytime it sees an oops message. The klogd manpage recommends against this mode of operation, however, since it makes klogd query the kernel for information after the problem has occurred. Information obtained after an error could be plain wrong.
For klogd to work properly, it must have a
current copy of the System.map
symbol table file.
Normally this file is found in /boot
; if you have
built and installed a kernel from a nonstandard location you may have
to copy System.map
into
/boot
, or tell klogd
to look elsewhere. klogd refuses to decode
symbols if the symbol table doesn’t match the current kernel. If a
symbol is decoded on the system log, you can be reasonably sure it is
decoded correctly.
At times klogd may not be enough for your tracing purposes. Usually, you need to get both the hexadecimal address and the associated symbol, and you often need offsets printed as hex numbers. You may need more information than address decoding. Also, it is common for klogd to get killed during the fault. In such situations, a stronger oops analyzer may be called for; ksymoops is such a tool.
Prior to the 2.3 development series, ksymoops was distributed with the kernel source, in the scripts directory. It now lives on its own FTP site and is maintained independently of the kernel. Even if you are working with an older kernel, you probably should go to ftp://ftp.ocs.com.au/pub/ksymoops and get an updated version of the tool.
To operate at its best, ksymoops needs a lot of information in addition to the error message; you can use command-line options to tell it where to find the various items. The program needs the following items:
- A System.map file
This map must correspond to the kernel that was running at the time the oops occurred. The default is
/usr/src/linux/System.map
.- A list of modules
ksymoops needs to know what modules were loaded when the oops occurred, in order to extract symbolic information from them. If you do not supply this list, ksymoops will look at
/proc/modules
.- A list of kernel symbols defined when the oops occurred
- A copy of the kernel image that was running
Note that ksymoops needs a straight kernel image, not the compressed version (vmlinuz, zImage, or bzImage) that most systems boot. The default is to use no kernel image because most people don’t keep it. If you have the exact image handy, you should tell the program where it is by using the -v option.
- The locations of the object files for any kernel modules that were loaded
ksymoops will look in the standard directories for modules, but during development you will almost certainly have to tell it where your module lives using the -o option
Although ksymoops will go to files in
/proc
for some of its needed information, the
results can be unreliable. The system, of course, will almost
certainly have been rebooted between the time the oops occurs and when
ksymoops is run, and the information from
/proc
may not match the state of affairs when the
failure occurred. When possible, it is better to save copies of
/proc/modules
and
/proc/ksyms
prior to causing the oops to
happen.
We urge driver developers to read the manual page for ksymoops because it is a very informative document.
The last argument on the tool’s command line is the location of the
oops message; if it is missing, the tool will read
stdin
in the best Unix tradition. The message can
be recovered from the system logs with luck; in the case of a very bad
crash you may end up writing it down off the screen and typing it back
in (unless you were using a serial console, a nice tool for kernel
developers).
Note that ksymoops will be confused by an oops message that has already been processed by klogd. If you are running klogd, and your system is still running after an oops occurs, a clean oops message can often be obtained by invoking the dmesg command.
If you do not provide all of the listed information explicitly, ksymoops will issue warnings. It will also issue warnings about things like loaded modules that define no symbols. A warning-free run of ksymoops is rare.
Output from ksymoops tends to look like the following:
>>EIP; c48370c3 <[faulty]faulty_write+3/20> <===== Trace; c01356e6 <sys_write+d6/100> Trace; c010b860 <system_call+34/38> Code; c48370c3 <[faulty]faulty_write+3/20> 00000000 <_EIP>: Code; c48370c3 <[faulty]faulty_write+3/20> <===== 0: c7 05 00 00 00 movl $0x0,0x0 <===== Code; c48370c8 <[faulty]faulty_write+8/20> 5: 00 00 00 00 00 Code; c48370cd <[faulty]faulty_write+d/20> a: 31 c0 xorl %eax,%eax Code; c48370cf <[faulty]faulty_write+f/20> c: 89 ec movl %ebp,%esp Code; c48370d1 <[faulty]faulty_write+11/20> e: 5d popl %ebp Code; c48370d2 <[faulty]faulty_write+12/20> f: c3 ret Code; c48370d3 <[faulty]faulty_write+13/20> 10: 8d b6 00 00 00 leal 0x0(%esi),%esi Code; c48370d8 <[faulty]faulty_write+18/20> 15: 00
As you can see, ksymoops provides
EIP
and kernel stack information much like
klogd does, but more precisely and in
hexadecimal. You’ll note that the faulty_write
function is correctly reported to be 0x20 bytes long. This is because
ksymoops reads the object file of your
module and extracts all available information.
In this case, moreover, you also get an assembly language dump of the code where the fault occurred. This information can often be used to figure out exactly what was happening; here it’s clearly an instruction that writes a 0 to address 0.
One interesting feature of ksymoops is that it is ported to nearly all the platforms where Linux runs and exploits the bfd (binary format description) library in order to support several computer architectures at the same time. To step outside of the PC world, let’s see how the same oops message appears on the SPARC64 platform (several lines have been broken for typographical needs):
Unable to handle kernel NULL pointer dereference tsk->mm->context = 0000000000000734 tsk->mm->pgd = fffff80003499000 \|/ ____ \|/ "@'/ .. \`@" /_| \__/ |_\ \__U_/ ls(16740): Oops TSTATE: 0000004400009601 TPC: 0000000001000128 TNPC: 0000000000457fbc \ Y: 00800000 g0: 000000007002ea88 g1: 0000000000000004 g2: 0000000070029fb0 \ g3: 0000000000000018 g4: fffff80000000000 g5: 0000000000000001 g6: fffff8000119c000 \ g7: 0000000000000001 o0: 0000000000000000 o1: 000000007001a000 o2: 0000000000000178 \ o3: fffff8001224f168 o4: 0000000001000120 o5: 0000000000000000 sp: fffff8000119f621 \ ret_pc: 0000000000457fb4 l0: fffff800122376c0 l1: ffffffffffffffea l2: 000000000002c400 \ l3: 000000000002c400 l4: 0000000000000000 l5: 0000000000000000 l6: 0000000000019c00 \ l7: 0000000070028cbc i0: fffff8001224f140 i1: 000000007001a000 i2: 0000000000000178 \ i3: 000000000002c400 i4: 000000000002c400 i5: 000000000002c000 i6: fffff8000119f6e1 \ i7: 0000000000410114 Caller[0000000000410114] Caller[000000007007cba4] Instruction DUMP: 01000000 90102000 81c3e008 <c0202000> \ 30680005 01000000 01000000 01000000 01000000
Note how the instruction dump doesn’t start from the instruction that caused the fault but three instructions earlier: that’s because the RISC platforms execute several instructions in parallel and may generate deferred exceptions, so one must be able to look back at the last few instructions.
This is what ksymoops prints when fed with
input data starting at the TSTATE
line:
>>TPC; 0000000001000128 <[faulty].text.start+88/a0> <===== >>O7; 0000000000457fb4 <sys_write+114/160> >>I7; 0000000000410114 <linux_sparc_syscall+34/40> Trace; 0000000000410114 <linux_sparc_syscall+34/40> Trace; 000000007007cba4 <END_OF_CODE+6f07c40d/????> Code; 000000000100011c <[faulty].text.start+7c/a0> 0000000000000000 <_TPC>: Code; 000000000100011c <[faulty].text.start+7c/a0> 0: 01 00 00 00 nop Code; 0000000001000120 <[faulty].text.start+80/a0> 4: 90 10 20 00 clr %o0 ! 0 <_TPC> Code; 0000000001000124 <[faulty].text.start+84/a0> 8: 81 c3 e0 08 retl Code; 0000000001000128 <[faulty].text.start+88/a0> <===== c: c0 20 20 00 clr [ %g0 ] <===== Code; 000000000100012c <[faulty].text.start+8c/a0> 10: 30 68 00 05 b,a %xcc, 24 <_TPC+0x24> \ 0000000001000140 <[faulty]faulty_write+0/20> Code; 0000000001000130 <[faulty].text.start+90/a0> 14: 01 00 00 00 nop Code; 0000000001000134 <[faulty].text.start+94/a0> 18: 01 00 00 00 nop Code; 0000000001000138 <[faulty].text.start+98/a0> 1c: 01 00 00 00 nop Code; 000000000100013c <[faulty].text.start+9c/a0> 20: 01 00 00 00 nop
To print the disassembled code shown we had to tell ksymoops the target file format and architecture (this is needed because the native architecture for SPARC64 user space is 32 bit). In this case, the options -t elf64-sparc -a sparc:v9 did the job.
You may complain that this call trace doesn’t carry any interesting
information; however, the SPARC processors don’t
save all the call trace on the stack: the O7
and
I7
registers hold the instruction pointers of the
last two calling functions, which is why they are shown near the call
trace. In this case, the faulty instruction was in a function invoked
by sys_write.
Note that, whatever the platform/architecture pair, the format used to show disassembled code is the same as that used by the objdump program. objdump is a powerful utility; if you want to look at the whole function that failed, you can invoke the command objdump -d faulty.o (once again, on SPARC64, you need special options: -- target elf64-sparc—architecture sparc:v9). For more information on objdump and its command-line options, see the manpage for the command.
Learning to decode an oops message requires some practice and an understanding of the target processor you are using, as well as of the conventions used to represent assembly language, but it’s worth doing. The time spent learning will be quickly repaid. Even if you have previous expertise with the PC assembly language under non-Unix operating systems, you may need to devote some time to learning, because the Unix syntax is different from Intel syntax. (A good description of the differences is in the Info documentation file for as, in the chapter called “i386-specific.”)
Although most bugs in kernel code end up as oops messages, sometimes they can completely hang the system. If the system hangs, no message is printed. For example, if the code enters an endless loop, the kernel stops scheduling, and the system doesn’t respond to any action, including the magic CTRL-ALT-DEL combination. You have two choices for dealing with system hangs—either prevent them beforehand or be able to debug them after the fact.
You can prevent an endless loop by inserting schedule invocations at strategic points. The schedule call (as you might guess) invokes the scheduler and thus allows other processes to steal CPU time from the current process. If a process is looping in kernel space due to a bug in your driver, the schedule calls enable you to kill the process, after tracing what is happening.
You should be aware, of course, that any call to schedule may create an additional source of reentrant calls to your driver, since it allows other processes to run. This reentrancy should not normally be a problem, assuming that you have used suitable locking in your driver. Be sure, however, not to call schedule any time that your driver is holding a spinlock.
If your driver really hangs the system, and you don’t know where to
insert schedule calls, the best way to go is to
add some print messages and write them to the console (by changing the
console_loglevel
value).
Sometimes the system may appear to be hung, but it isn’t. This can
happen, for example, if the keyboard remains locked in some strange
way. These false hangs can be detected by looking at the output of a
program you keep running for just this purpose. A clock or system
load meter on your display is a good status monitor; as long as it
continues to update, the scheduler is working. If you are not using a
graphic display, you can check the scheduler by running a program that
flashes the keyboard LEDs, turns on the floppy motor every now and
then, or ticks the speaker—conventional beeps are quite annoying
and should be avoided; look for the KDMKTONE
ioctl command instead. A sample program
(misc-progs/heartbeat.c
) that flashes a keyboard
LED in a heartbeat fashion is available in the sources on the O’Reilly
FTP site.
If the keyboard isn’t accepting input, the best thing to do is log into the system through your network and kill any offending processes, or reset the keyboard (with kbd_mode -a). However, discovering that the hang is only a keyboard lockup is of little use if you don’t have a network available to help you recover. If this is the case, you could set up alternative input devices to be able at least to reboot the system cleanly. A shutdown and reboot cycle is easier on your computer than hitting the so-called big red button, and it saves you from the lengthy fsck scanning of your disks.
Such an alternative input device can be, for example, the mouse. Version 1.10 or newer of the gpm mouse server features a command-line option to enable a similar capability, but it works only in text mode. If you don’t have a network connection and run in graphics mode, we suggest running some custom solution, like a switch connected to the DCD pin of the serial line and a script that polls for status change.
An indispensable tool for these situations is the “magic SysRq key,” which is available on more architectures in 2.2 and later kernels. Magic SysRq is invoked with the combination of the ALT and SysRq keys on the PC keyboard, or with the ALT and Stop keys on SPARC keyboards. A third key, pressed along with these two, performs one of a number of useful actions, as follows:
-
r
Turns off keyboard raw mode in situations where you cannot run kbd_mode.
-
k
Invokes the “secure attention” (SAK) function. SAK will kill all processes running on the current console, leaving you with a clean terminal.
-
s
Performs an emergency synchronization of all disks.
-
u
Attempts to remount all disks in a read-only mode. This operation, usually invoked immediately after s, can save a lot of filesystem checking time in cases where the system is in serious trouble.
-
b
Immediately reboots the system. Be sure to synchronize and remount the disks first.
-
p
Prints the current register information.
-
t
Prints the current task list.
-
m
Prints memory information.
Other magic SysRq functions exist; see sysrq.txt in the Documentation directory of the kernel source for the full list. Note that magic SysRq must be explicitly enabled in the kernel configuration, and that most distributions do not enable it, for obvious security reasons. For a system used to develop drivers, however, enabling magic SysRq is worth the trouble of building a new kernel in itself. Magic SysRq must be enabled at runtime with a command like the following:
echo 1 > /proc/sys/kernel/sysrq
Another precaution to use when reproducing system hangs is to mount all your disks as read-only (or unmount them). If the disks are read-only or unmounted, there’s no risk of damaging the filesystem or leaving it in an inconsistent state. Another possibility is using a computer that mounts all of its filesystems via NFS, the network file system. The “NFS-Root” capability must be enabled in the kernel, and special parameters must be passed at boot time. In this case you’ll avoid any filesystem corruption without even resorting to SysRq, because filesystem coherence is managed by the NFS server, which is not brought down by your device driver.
Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.