BUY THIS BOOK
Add to Cart

Print Book $49.95


Add to Cart

Print+PDF $64.94

Add to Cart

PDF $39.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £35.50

What is this?

Looking to Reprint or License this content?


Understanding Linux Network Internals
Understanding Linux Network Internals By Christian Benvenuti
December 2005
Pages: 1062

Cover | Table of Contents


Table of Contents

Chapter 1: Introduction
To do research in the source code of a large project is to enter a strange, new land with its own customs and unspoken expectations. It is useful to learn some of the major conventions up front, and to try interacting with the inhabitants instead of merely standing back and observing.
The bulk of this chapter is devoted to introducing you to a few of the common programming patterns and tricks that you'll often meet in the networking code.
I encourage you, when possible, to try interacting with a given part of the kernel networking code by means of user-space tools. So in this chapter, I'll give you a few pointers as to where you can download those tools if they're not already installed on your preferred Linux distribution, or if you simply want to upgrade them to the latest versions.
I'll also describe some tools that let you find your way gracefully through the enormous kernel code. Finally, I'll explain briefly why a kernel feature may not be integrated into the official kernel releases, even if it is widely used in the Linux community.
In this section, I'll introduce terms and abbreviations that are going to be used extensively in this book.
Eight-bit quantities are normally called octets in the networking literature. In this book, however, I use the more familiar term byte. After all, the book describes the behavior of the kernel rather than some network abstraction, and kernel developers are used to thinking in terms of bytes .
The terms vector and array will be used interchangeably.
When referring to the layers of the TCP/IP network stack, I will use the abbreviations L2, L3, and L4 to refer to the link, network, and transport layers, respectively. The numbers are based on the famous (if not exactly current) seven-layer OSI model. In most cases, L2 will be a synonym for Ethernet, L3 for IP Version 4 or 6, and L4 for UDP, TCP, or ICMP. When I need to refer to a specific protocol, I'll use its name (i.e., TCP) rather than the generic
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Basic Terminology
In this section, I'll introduce terms and abbreviations that are going to be used extensively in this book.
Eight-bit quantities are normally called octets in the networking literature. In this book, however, I use the more familiar term byte. After all, the book describes the behavior of the kernel rather than some network abstraction, and kernel developers are used to thinking in terms of bytes .
The terms vector and array will be used interchangeably.
When referring to the layers of the TCP/IP network stack, I will use the abbreviations L2, L3, and L4 to refer to the link, network, and transport layers, respectively. The numbers are based on the famous (if not exactly current) seven-layer OSI model. In most cases, L2 will be a synonym for Ethernet, L3 for IP Version 4 or 6, and L4 for UDP, TCP, or ICMP. When I need to refer to a specific protocol, I'll use its name (i.e., TCP) rather than the generic Ln protocol term.
In different chapters, we will see how data units are received and transmitted by the protocols that sit at a given layer in the network stack. In those contexts, the terms ingress and input will be used interchangeably. The same applies to egress and output. The action of receiving or transmitting a data unit may be referred to with the abbreviations RX and TX, respectively.
A data unit is given different names, such as frame, packet, segment, and message, depending on the layer where it is used (see Chapter 13 for more details). Table 1-1 summarizes the major abbreviations you'll see in the book.
Table 1-1: Abbreviations used frequently in this book
Abbreviation
Meaning
L2
Link layer (e.g., Ethernet)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Common Coding Patterns
Each networking feature, like any other kernel feature, is just one of the citizens inside the kernel. As such, it must make proper and fair use of memory, CPU, and all other shared resources. Most features are not written as standalone pieces of kernel code, but interact with other kernel components more or less heavily depending on the feature. They therefore try, as much as possible, to follow similar mechanisms to implement similar functionalities (there is no need to reinvent the wheel every time).
Some requirements are common to several kernel components, such as the need to allocate several instances of the same data structure type, the need to keep track of references to an instance of a data structure to avoid unsafe memory deallocations, etc. In the following subsections, we will view common ways in Linux to handle such requirements. I will also talk about common coding tricks that you may come across while browsing the kernel's code.
This book uses subsystem as a loose term to describe a collection of files that implement a major set of features—such as IP or routing—and that tend to be maintained by the same people and to change in lockstep. In the rest of the chapter, I'll also use the term kernel component to refer to these subsystems, because the conventions discussed here apply to most parts of the kernel, not just those involved in networking.
The kernel uses the kmalloc and kfree functions to allocate and free a memory block, respectively. The syntax of those two functions is similar to that of the two sister calls, malloc and free, from the libc user-space library. For more details on kmalloc and kfree, please refer to Linux Device Drivers (O'Reilly).
It is common for a kernel component to allocate several instances of the same data structure type. When allocation and deallocation are expected to happen often, the associated kernel component initialization routine (for example,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
User-Space Tools
Different tools can be used to configure the many networking features available on Linux. As mentioned at the beginning of the chapter, you can make thoughtful use of these tools to manipulate the kernel for learning purposes and to discover the effects of these changes.
The following tools are the ones I will refer often to in this book:
iputils
Besides the perennial command ping, iputils includes arping (used to generate ARP requests), the Network Router Discovery daemon rdisc, and others.
net-tools
This is a suite of networking tools, where you can find the well-known ifconfig, route, netstat, and arp, but also ipmaddr, iptunnel, ether-wake, netplugd, etc.
IPROUTE2
This is the new-generation networking configuration suite (although it has been around for a few years already). Through an omnibus command named ip, the suite can be used to configure IP addresses and routing along with all of its advanced features, neighboring protocols, etc.
IPROUTE2's source code can be downloaded from http://linux-net.osdl.org/index.php/Iproute2, and the other packages can be downloaded from the download server of most Linux distributions.
These packages are included by default on most (if not all) Linux distributions. Whenever you do not understand how the kernel code processes a command from user space, I encourage you to look at the user-space tool source code and see how the command from the user is packaged and sent to the kernel.
At the following URLs, you can find good documentation on how to use the aforementioned tools, including active mailing lists:
  • http://lartc.org
  • http://www.policyrouting.org
  • http://www.netfilter.org
If you want to follow the latest changes in the networking code, keep an eye on the following mailing list:
  • The Linux Network Development List Archives (
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Browsing the Source Code
The Linux kernel has gotten pretty big, and browsing the code with our old friend grep is definitely not a good idea anymore. Nowadays you can count on different pieces of software to make your journey into the kernel code a better experience.
One that I would like to suggest to those that do not know it already is cscope, which you can download from http://cscope.sourceforge.net. It is a simple yet powerful tool for searching, for example, where a function or variable is defined, where it is called, etc. Installing the tool is straightforward and you can find all the necessary instructions on the web site.
Each of us has his preferred editor, and probably the majority of us are fans of some form of either Emacs or vi. Both editors can use a special file called a "tags" file, to allow the user to move through source code. (cscope also uses a similar database file.) You can easily create such files with a synonymous target in the kernel root tree's makefile. The three databases: TAGS, tags, and cscope.out, are created, respectively, with make TAGS, make tags, and make cscope.
Be aware that those files are pretty big, especially the one used by cscope. Therefore, make sure before building the file that you have a lot of free disk space.
If you are already using other source navigation tools, fine. But if you are not using any and have been lazy so far, it is time to say goodbye to grep and invest 15 minutes in learning how to use the aforementioned tools—they are well worth it.
The kernel, like any other large and dynamic piece of software, includes pieces of code that are no longer invoked. Unfortunately, you rarely see comments in the code that tell you this. You may sometimes find yourself having trouble trying to understand how a given function is used or a given variable is initialized simply because you are looking at dead code. If you are lucky, that code does not compile and you can guess its out-of-date status. Other times you may not be that lucky.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
When a Feature Is Offered as a Patch
The kernel networking code is continuously evolving. Not only does it integrate new features, but existing components sometimes undergo design changes to achieve more modularity and higher performance. This obviously makes Linux very attractive as an embedded operating system for network appliance products (routers, switches, firewalls, load balancers, etc.).
Because anyone can develop a new feature for the Linux kernel, or extend or reimplement an existing one, the greatest thrill for any "open" developer is to see her work make it to the official kernel release. Sometimes, however, that is not possible or it may take a long time, even when a project has valuable features and is well implemented. Common reasons include:
  • The code may not have been written following the guidelines in Documentation/CodingStyle.
  • Another major project that provides the same functionality has been around for some time and has already received the green light from the Linux community and from the key kernel developers that maintain the associated kernel area.
  • There is too much overlap with another kernel component. In a case like this, the best approach is to remove the redundant functionality and use existing functionality where possible, or to extend the latter so that it can be used in new contexts. This situation underlines the importance of modularity.
  • The size of the project and the amount of work required to maintain it in a quick-changing kernel may lead the new project's developers to keep it as a separate patch and release a new version only once in a while.
  • The feature would be used only in very specific scenarios, considered not necessary in a general-purpose operating system. In this case, a separate patch is often the best solution.
  • The overall design may not satisfy some key kernel developers. These experts usually have the big picture in mind, concerning both where the kernel is and where it is going. Often, they request design changes to make a feature fit into the kernel the right way.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Critical Data Structures
A few key data structures are referenced throughout the Linux networking code. Both when reading this book and when studying the source code directly, you'll need to understand the fields in these data structures. To be sure, going over data structures field by field is less fun than unraveling functions, but it's an important foundation to have. "Show me your data," said the legendary software engineer, Frederick P. Brooks.
This chapter introduces the following data structures, and mentions some of the functions and macros that manipulate them:
struct sk_buff
This is where a packet is stored. The structure is used by all the network layers to store their headers, information about the user data (the payload), and other information needed internally for coordinating their work.
struct net_device
Each network device is represented in the Linux kernel by this data structure, which contains information about both its hardware and its software configuration. See Chapter 8 for details on when and how net_device data structures are allocated.
Another critical data structure for Linux networking is struct sock, which stores the networking information for sockets. Because this book does not cover sockets, I have not included sock in this chapter.
This is probably the most important data structure in the Linux networking code, representing the headers for data that has been received or is about to be transmitted. Defined in the <include/linux/skbuff.h> include file, it consists of a tremendous heap of variables that try to be all things to all people.
The structure has changed many times in the history of the kernel, both to add new options and to reorganize existing fields into a cleaner layout. Its fields can be classified roughly into the following categories:
  • Layout
  • General
  • Feature-specific
  • Management functions
This structure is used by several different network layers (MAC or another link protocol on the L2 layer, IP on L3, TCP or UDP on L4), and various fields of the structure change as it is passed from one layer to another. L4 appends a header before passing it to L3, which in turn puts on its own header before passing it to L2. Appending headers is more efficient than copying the data from one layer to another. Since adding space to the beginning of a buffer—which means changing the variable that points to it—is a complicated operation, the kernel provides the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Socket Buffer: sk_buff Structure
This is probably the most important data structure in the Linux networking code, representing the headers for data that has been received or is about to be transmitted. Defined in the <include/linux/skbuff.h> include file, it consists of a tremendous heap of variables that try to be all things to all people.
The structure has changed many times in the history of the kernel, both to add new options and to reorganize existing fields into a cleaner layout. Its fields can be classified roughly into the following categories:
  • Layout
  • General
  • Feature-specific
  • Management functions
This structure is used by several different network layers (MAC or another link protocol on the L2 layer, IP on L3, TCP or UDP on L4), and various fields of the structure change as it is passed from one layer to another. L4 appends a header before passing it to L3, which in turn puts on its own header before passing it to L2. Appending headers is more efficient than copying the data from one layer to another. Since adding space to the beginning of a buffer—which means changing the variable that points to it—is a complicated operation, the kernel provides the skb_reserve function (described later in this chapter) to carry it out. Thus, one of the first things done by each protocol, as the buffer passes down through layers, is to call skb_reserve to reserve space for the protocol's header. In the later section "Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull," we will see an example of how the kernel makes sure enough space is reserved at the head of the buffer to allow each layer to add its own header while the buffer traverses the layers.
When the buffer passes up through the network layers, each header from the old layer is no longer of interest. The L2 header, for instance, is used only by the device drivers that handle the L2 protocol, so it is of no interest to L3. Instead of removing the L2 header from the buffer, the pointer to the beginning of the payload is moved ahead to the beginning of the L3 header, which requires fewer CPU cycles.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
net_device Structure
The net_device data structure stores all information specifically regarding a network device. There is one such structure for each device, both real ones (such as Ethernet NICs) and virtual ones (such as bonding or VLAN). In this section, I will use the words interface and device interchangeably, even though the difference between them is important in other contexts.
The net_device structures for all devices are put into a global list to which the global variable dev_base points. The data structure is defined in include/linux/netdevice.h. The registration of network devices is described in Chapter 8. In that chapter, you can find details on how and when most of the net_device fields are initialized.
Like sk_buff, this structure is quite big and includes many feature-specific parameters, along with parameters from many different layers. For this reason, the overall organization of the structure will probably see some changes soon for optimization reasons.
Network devices can be classified into types such as Ethernet cards and Token Ring cards. While certain fields of the net_device structure are set to the same value for all devices of the same type, some fields must be set differently by each model of device. Thus, for almost every type, Linux provides a general function that initializes the parameters whose values stay the same across all models. Each device driver invokes this function in addition to setting those fields that have unique values for its model. Drivers can also overwrite fields that were already initialized by the kernel (for instance, to improve performance). You can find more details in Chapter 8.
The fields of the net_device structure can be classified into the following categories:
  • Configuration
  • Statistics
  • Device status
  • List management
  • Traffic management
  • Feature specific
  • Generic
  • Function pointers (or VFT)
The net_device structure includes three identifiers , not to be confused:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Files Mentioned in This Chapter
Figure 2-11 shows the main files referenced in this chapter. The missing ones will be introduced in upcoming chapters.
Figure 2-11: Files referenced in this chapter
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: User-Space-to-Kernel Interface
In this chapter, I'll briefly introduce the main mechanisms that user-space applications can use to communicate with the kernel or read information exported by it. We will not look at the details of their implementations, because each mechanism would deserve a chapter of its own. The purpose of this chapter is to give you enough pointers to the code and to external documentation so that you can further investigate the topic if interested. For example, with this chapter, you have the information you need to find how and where a given directory is added to /proc, kernel handler which processes a given ioctl command, and what functions are provided by Netlink, currently the preferred interface for user-space network configuration.
This chapter focuses only on the mechanisms that I will often mention in the book when talking about the interface between the user-space configuration commands such as ifconfig and route and the kernel handlers that apply the requested configurations. For an analysis of the generic messaging systems available for intrakernel communication as well as kernel-to-user-space communication, please refer to Understanding the Linux Kernel (O'Reilly).
The discussion of each feature in this book ends with a set of sections that show how user-space configuration tools and the kernel communicate. The information in this chapter can help you understand those sections better.
The kernel exports internal information to user space via different interfaces. Besides the classic set of system calls the application programmer can use to ask for specific information, there are three special interfaces, two of which are virtual filesystems:
procfs(/proc filesystem)
This is a virtual filesystem, usually mounted in /proc, that allows the kernel to export internal information to user space in the form of files. The files don't actually exist on disk, but they can be read through
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Overview
The kernel exports internal information to user space via different interfaces. Besides the classic set of system calls the application programmer can use to ask for specific information, there are three special interfaces, two of which are virtual filesystems:
procfs(/proc filesystem)
This is a virtual filesystem, usually mounted in /proc, that allows the kernel to export internal information to user space in the form of files. The files don't actually exist on disk, but they can be read through cat or more and written to with the > shell redirector; they even can be assigned permission like real files. The components of the kernel that create these files can therefore say who can read from or write to any file. Directories cannot be written (i.e., no user can add or remove a file or a directory to or from any directory in /proc).
The default kernel that comes with most (if not all) Linux distributions includes support for procfs. It cannot be compiled as a module. The associated kernel option from the configuration menu is "Filesystems → Pseudo filesystems → /proc file system support."
sysctl(/proc/sys directory)
This interface allows user space to read and modify the value of kernel variables. You cannot use it for every kernel variable: the kernel has to explicitly say what variables are visible through this interface. From user space, you can access the variables exported by sysctl in two ways. One is the sysctl system call (see man sysctl) and the other one is procfs. When the kernel has support for procfs, it adds a special directory (/proc/sys) to /proc that includes a file for each kernel variable exported by sysctl.
The sysctl command that comes with the procps package can be used to configure variables exported with the sysctl interface. The command talks to the kernel by writing to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
procfs Versus sysctl
Both procfs and sysctl export kernel-internal information, but procfs mainly exports read-only data, while most sysctl information is writable too (but only by the superuser).
As far as exporting read-only data, the choice between procfs and sysctl depends on how much information is supposed to be exported. Files associated with a simple kernel variable or data structure are exported with sysctl. The others, which are associated with more complex data structures and may need special formatting, are exported with procfs. Examples of the latter category are caches and statistics.
Most networking features register one or more files in /proc when they get initialized, either at boot time or at module load time. When a user reads the file, it causes the kernel to indirectly run a set of kernel functions that return some kind of output. The files registered by the networking code are located in /proc/net.
Directories in /proc can be created with proc_mkdir. Files in /proc/net can be registered and unregistered with proc_net_fops_create and proc_net_remove, defined in include/linux/proc_fs.h. These two routines are wrappers around the generic APIs create_proc_entry and remove_proc_entry. In particular, proc_net_fops_create takes care of creating the file (with proc_net_create) and initializing its file operation handlers. Let's look at an example.
This is how the ARP protocol registers its arp file in /proc/net:
static struct file_operations arp_seq_fops = {
    .owner      = THIS_MODULE,
    .open       = arp_seq_open,
    .read       = seq_read,
    .llseek     = seq_lseek,
    .release    = seq_release_private,
};

static int _ _init arp_proc_init(void)
{
    if (!proc_net_fops_create("arp", S_IRUGO, &arp_seq_fops))
        return -ENOMEM;
    return 0;
}
The three input parameters to proc_net_fops_create tell you that the filename is arp, it must be assigned read permission only, and the set of file operation handlers is
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
ioctl
At the top of Figure 3-4, you can see how an ioctl call is issued. Let's see an example involving ifconfig.
We said earlier that the ifconfig command uses ioctl to communicate with the kernel. For example, when the system administrator types a command like ifconfig eth0 mtu 1250 to change the MTU of the interface eth0, ifconfig opens a socket, initializes a local data structure with the information received from the system administrator (data in the example), and passes it to the kernel with an ioctl call. SIOCSIFMTU is the command identifier.
    struct ifreq data;
    fd = socket(PF_INET, SOCK_DGRAM, 0);
    < ... initialize "data" ...>
    err = ioctl(fd, SIOCSIFMTU, &data);
ioctl commands are processed by the kernel in different places. Figure 3-4 shows how the most common ioctl commands used by the networking code are dispatched by sock_ioctl and routed to the right function handler. We will not see how sock_ioctl is invoked or how transport protocols like UDP and TCP register their handlers. If you desire to dig into this part of the code, you can use the figure as a starting point. For the routines that we cover in this book, the figure provides a reference to the right chapter.
Figure 3-3: Creation of the core directories in /proc/sys/net
The name of the ioctl commands in the figure is parsed (split into components) for your convenience. For example, the command used to add a route to a routing table, SIOCADDRT, is shown as SIOC ADD RT to emphasize the two interesting components: ADD, which says you are adding something, and RT, which says a route is what you are adding. Most commands follow this syntax. Often, when a given object type can be both read and written, you have one more component in the command name: G for get or S for set. The two commands that add and remove an IP address from an interface,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Netlink
The Netlink socket, well described in RFC 3549, represents the preferred interface between user space and kernel for IP networking configuration. Netlink can also be used as an intrakernel messaging system as well as between multiple user-space processes.
With Netlink sockets you can use the standard socket APIs to open, close, transmit on, and receive from a socket. Let's quickly review the prototype of the socket system call:
int socket(int domain, int type, int protocol)
For details on what the three arguments are initialized to with TCP/IP sockets (i.e., domain PF_INET), you can use the man socket command.
As with any other socket, when you open a Netlink socket, you need to provide the domain, type, and protocol arguments. Netlink uses the new PF_NETLINK protocol family (domain), supports only the SOCK_DGRAM type, and defines several protocols, each one used for a different component (or a set of components) of the networking stack. For example, the NETLINK_ROUTE protocol is used for most networking features, such as routing and neighboring protocols, and NETLINK_FIREWALL is used for the firewall (Netfilter). The Netlink protocols are listed in the NETLINK_ XXX enumeration list in include/linux/netlink.h.
With Netlink sockets, endpoints are usually identified by the ID of the process that opened the sockets (PID), where the special value 0 identifies the kernel. Among Netlink's features is the ability to send both unicast and multicast messages: the destination endpoint address can be a PID, a multicast group ID, or a combination of the two. The kernel defines Netlink multicast groups for the purpose of sending out notifications about particular kinds of events, and user programs can register to those groups if they are interested in them. The groups are listed in the enumeration list RTMGRP_ XXX in include/linux/rtnetlink.h. Among them are the RTMGRP_IPV4_ROUTE
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Serializing Configuration Changes
Any time you apply a configuration change, the handler that takes care of it inside the kernel acquires a semaphore (rtnl_sem) that ensures exclusive access to the data structures that store the networking configuration. This is true regardless of whether the configuration is applied via ioctl or Netlink.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Notification Chains
The kernel's many subsystems are heavily interdependent, so an event detected or generated by one of them could be of interest to others. To fulfill the need for interaction, Linux uses so-called notification chains .
In this chapter, we will see:
  • How notification chains are declared and what chains are defined by the networking code
  • How a kernel subsystem can register to a notification chain
  • How a kernel subsystem generates a notification on a chain
Note that notification chains are used only between kernel subsystems. Notifications between kernel and user space rely on other mechanisms, such as those introduced in Chapter 3.
Suppose we had the Linux router in Figure 4-1 with four interfaces. The figure shows the relationship between the router and five networks, along with a simplified version of its routing table.
Let's look at some examples of the topology in Figure 4-1. Network A is directly connected to RT on interface eth0, and network F is not directly connected to RT, but RT's eth3 is directly connected to another router that has an interface with address IP1, and that second router knows how to reach network F. The other cases are similar. In short, some networks are directly connected and others require the help of one or more additional routers to be reached.
For a detailed description of how the routing code handles this situation, refer to Part VII. In this chapter, we will concentrate on the role of notification chains. Suppose that interface eth3 went down, due to a break in the network, an administrative command (such as ifconfig eth3 down) or a hardware failure. Networks D, E, and F would become unreachable by RT (and by systems in A, B, and C relying on RT for their connections) and should be removed from the routing table. Who is going to tell the routing subsystem about that interface failure? A notification chain.
Figure 4-1: Example of Linux router
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reasons for Notification Chains
Suppose we had the Linux router in Figure 4-1 with four interfaces. The figure shows the relationship between the router and five networks, along with a simplified version of its routing table.
Let's look at some examples of the topology in Figure 4-1. Network A is directly connected to RT on interface eth0, and network F is not directly connected to RT, but RT's eth3 is directly connected to another router that has an interface with address IP1, and that second router knows how to reach network F. The other cases are similar. In short, some networks are directly connected and others require the help of one or more additional routers to be reached.
For a detailed description of how the routing code handles this situation, refer to Part VII. In this chapter, we will concentrate on the role of notification chains. Suppose that interface eth3 went down, due to a break in the network, an administrative command (such as ifconfig eth3 down) or a hardware failure. Networks D, E, and F would become unreachable by RT (and by systems in A, B, and C relying on RT for their connections) and should be removed from the routing table. Who is going to tell the routing subsystem about that interface failure? A notification chain.
Figure 4-1: Example of Linux router
Figure 4-2 shows a slightly more complicated example where the routing subsystem interacts with dynamic routing protocols—protocols that can adjust the routing table or tables to the network topology and therefore cope with interface failures when the topology allows it (i.e., when there are redundant paths).
Figure 4-2: Example of a Linux router with dynamic routing protocols
In Figure 4-2, network F could be reached by RT by passing through both network A and network E. E was chosen initially because of its smaller cost, but now that E is no longer reachable, the routing table should update the route for network F to go through network A. The basis for such a decision could include local host events, such as device registration and unregistration, as well as complex factors in router configuration and the routing protocols used. In any case, the routing subsystem that manages the tables must be informed of the relevant information by some other subsystem, demonstrating the need for notification chains.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Overview
A notification chain is simply a list of functions to execute when a given event occurs. Each function lets one other subsystem know about an event that occurred within, or was detected by, the subsystem calling the function.
Thus, for each notification chain there is a passive side (the notified) and an active side (the notifier), as in the so-called publish-and-subscribe model:
  • The notified are the subsystems that ask to be notified about the event and that provide a callback function to invoke.
  • The notifier is the subsystem that experiences an event and calls the callback function.
The functions executed are chosen by the notified subsystems. It is never up to the owner of the chain (the subsystem that generates the notifications) to decide what functions to execute. The owner simply defines the list; any kernel subsystem can register a callback function with that chain to receive the notification.
The use of notification chains makes the source code easier to write and maintain. Imagine how a generic routine might notify external subsystems about an event without using notification chains:
If (subsystem_X_enabled) {
    do_something_1
}
if (subsystem_Y_enabled) {
    do_something_2
}
If (subsystem_Z_enabled) {
    do_something_3
}
... ... ...
In other words, a conditional clause would have to be included for every possible subsystem that might be interested in an event, and the maintainer of this subsystem would have to add a new clause every time somebody else added a subsystem to the kernel.
No subsystem maintainer is expected to keep track of every subsystem added to the kernel. However, each subsystem maintainer should know:
  • The kinds of events from other subsystems he is interested in
  • The kinds of events he knows about and that other subsystems may be interested in
Thus, notification chains allow each subsystem to share the occurrence of an event with others, without having to know what the others are and why they are interested.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Defining a Chain
The elements of the notification chain's list are of type notifier_block, whose definition is the following:
struct notifier_block
{
    int (*notifier_call)(struct notifier_block *self, unsigned long, void *);
    struct notifier_block *next;
    int priority;
};
notifier_call is the function to execute, next is used to link together the elements of the list, and priority represents the priority of the function. Functions with higher priority are executed first. But in practice, almost all registrations leave the priority out of the notifier_block definition, which means it gets the default value of 0 and execution order ends up depending only on the registration order (i.e., it is a semirandom order). The return values of notifier_call are listed in the upcoming section, "Notifying Events on a Chain."
Common names for notifier_block instances are xxx _chain, xxx _notifier_chain, and xxx _notifier_list.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Registering with a Chain
When a kernel component is interested in the events of a given notification chain, it can register it with the general function notifier_chain_register. The kernel also provides a set of wrappers around notifier_chain_register, some of which are shown in Table 4-1.
Table 4-1 lists the main APIs and the associated wrappers used to register and unregister to the three chains inetaddr_chain , inet6addr_chain , and netdev_chain.
Table 4-1: Main functions and wrappers for a few chains
Operation
Function prototype
Registration
int notifier_chain_register(struct notifier_block **list, struct notifier_block *n)
Wrappers
inetaddr_chain
register_inetaddr_notifier
inet6addr_chain
register_inet6addr_notifier
netdev_chain
register_netdevice_notifier
Unregistration
int notifier_chain_unregister(struct notifier_block **nl, struct notifier_block *n)
Wrappers
inetaddr_chain
unregister_inetaddr_notifier
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Notifying Events on a Chain
Notifications are generated with notifier_call_chain, defined in kernel/sys.c. This function simply invokes, in order of priority, all the callback routines registered against the chain. Note that callback routines are executed in the context of the process that calls notifier_call_chain. A callback routine could, however, be implemented so that it queues the notification somewhere and wakes up a process that will look at it.
int notifier_call_chain(struct notifier_block **n, unsigned long val, void *v)
{
    int ret = NOTIFY_DONE;
    struct notifier_block *nb = *n;
 
    while (nb)
    {
        ret = nb->notifier_call(nb, val, v);
        if (ret & NOTIFY_STOP_MASK)
        {
            return ret;
        }
        nb = nb->next;
    }
    return ret;
}
This is the meaning of its three input parameters:
n
Notification chain.
val
Event type. The chain itself identifies a class of events; val unequivocally identifies an event type (i.e., NETDEV_REGISTER).
v
Input parameter that can be used by the handlers registered by the various clients. This can be used in different ways under different circumstances. For instance, when a new network device is registered with the kernel, the associated notification uses v to identify the net_device data structure.
The callback routines called by notifier_call_chain can return any of the NOTIFY_ XXX values defined in include/linux/notifier.h:
NOTIFY_OK
Notification was processed correctly.
NOTIFY_DONE
Not interested in the notification.
NOTIFY_BAD
Something went wrong. Stop calling the callback routines for this event.
NOTIFY_STOP
Routine invoked correctly. However, no further callbacks need to be called for this event.
NOTIFY_STOP_MASK
This flag is checked by notifier_call_chain to see whether to stop invoking the callback routines, or keep going. Both
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Notification Chains for the Networking Subsystems
The kernel defines at least 10 different notification chains. Here we are interested in the ones that are used to signal events of particular importance to the networking code. The main ones are:
inetaddr_chain
Sends notifications about the insertion, removal, and change of an Internet Protocol Version 4 (IPv4) address on a local interface. Chapter 23 describes when such notifications are generated. Internet Protocol Version 6 (IPv6) uses a similar chain (inet6addr_chain ).
netdev_chain
Sends notifications about the registration status of network devices. Chapter 8 describes when such notifications are generated.
For these chains, and others used by the networking subsystems, their purposes and uses are described in the chapter about the relevant notifier subsystem.
The networking code can register to notifications generated by other kernel components, too. For example, some NIC device drivers register with the reboot_notifier_list chain, which is a chain that warns when the system is about to reboot.
Most notification chains come with a set of wrappers used to register to them and unregister from them. For example, this is the wrapper used to register to netdev_chain:
int register_netdevice_notifier(struct notifier_block *nb)
{
        return notifier_chain_register(&netdev_chain, nb);
}
Common names for wrappers include [un]register_ xxx _notifier, xxx _[un]register_notifier, and xxx _[un]register.
Registrations to notification chains usually take place when the interested kernel component is initialized. For example, the following snapshot from net/ipv4/fib_frontend.c shows ip_fib_init, which is the initialization routine used by the routing code that is described in the section "Routing Subsystem Initialization" in Chapter 32:
static struct notifier_block fib_inetaddr_notifier = {
    .notifier_call = fib_inetaddr_event,
};
 
static struct notifier_block fib_netdev_notifier = {
    .notifier_call = fib_netdev_event,
};
 
void _ _init ip_fib_init(void)
{
    ... ... ...
    register_netdevice_notifier(&fib_netdev_notifier);
    register_inetaddr_notifier(&fib_inetaddr_notifier);
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Tuning via /proc Filesystem
There is no file of interest in /proc as far as this chapter is concerned.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Functions and Variables Featured in This Chapter
Content preview·Buy PDF of this chapter|