Chapter 4. Container Isolation
This is the chapter in which you’ll find out how containers really work! This will be essential to understanding the extent to which containers are isolated from each other and from the host. You will be able to assess for yourself the strength of the security boundary that surrounds a container.
As you’ll know if you have ever run docker exec <image> bash
, a container looks a lot like a virtual machine from the inside. If you have shell access to a container and run ps
, you can see only the processes that are running inside it. The container has its own network stack, and it seems to have its own filesystem with a root directory that bears no relation to root on the host. You can run containers with limited resources, such as a restricted amount of memory or a fraction of the available CPUs. This all happens using the Linux features that we’re going to delve into in this chapter.
However much they might superficially resemble each other, it’s important to realize that containers aren’t virtual machines, and in Chapter 5 we’ll take a look at the differences between these two types of isolation. In my experience, really understanding and being able to contrast the two is absolutely key to grasping the extent to which traditional security measures can be effective in containers, and to identifying where container-specific tooling is necessary.
You’ll see how containers are built out of Linux constructs such as namespaces and chroot
, along with cgroups, which were covered in Chapter 3. With an understanding of these constructs under your belt, you’ll have a feeling for how well protected your applications are when they run inside containers.
Although the general concepts of these constructs are fairly straightforward, the way they work together with other features of the Linux kernel can be complex. Container escape vulnerabilities (for example, CVE-2019-5736, a serious vulnerability discovered in both runc
and LXC
) have been based on subtleties in the way that namespaces, capabilities, and filesystems interact.
Linux Namespaces
If cgroups control the resources that a process can use, namespaces control what it can see. By putting a process in a namespace, you can restrict the resources that are visible to that process.
The origins of namespaces date back to the Plan 9 operating system. At the time, most operating systems had a single “name space” of files. Unix systems allowed the mounting of filesystems, but they would all be mounted into the same system-wide view of all filenames. In Plan 9, each process was part of a process group that had its own “name space” abstraction, the hierarchy of files (and file-like objects) that this group of processes could see. Each process group could mount its own set of filesystems without seeing each other.
The first namespace was introduced to the Linux kernel in version 2.4.19 back in 2002. This was the mount namespace, and it followed similar functionality to that in Plan 9. Nowadays there are several different kinds of namespace supported by Linux:
-
Unix Timesharing System (UTS)—this sounds complicated, but to all intents and purposes this namespace is really just about the hostname and domain names for the system that a process is aware of.
It’s possible that more resources will be namespaced in future revisions of the Linux kernel. For example, there have been discussions about having a namespace for time.
A process is always in exactly one namespace of each type. When you start a Linux system it has a single namespace of each type, but as you’ll see, you can create additional namespaces and assign processes into them. You can easily see the namespaces on your machine using the lsns
command:
vagrant@myhost:~$ lsns NS TYPE NPROCS PID USER COMMAND 4026531835 cgroup 3 28459 vagrant /lib/systemd/systemd --user 4026531836 pid 3 28459 vagrant /lib/systemd/systemd --user 4026531837 user 3 28459 vagrant /lib/systemd/systemd --user 4026531838 uts 3 28459 vagrant /lib/systemd/systemd --user 4026531839 ipc 3 28459 vagrant /lib/systemd/systemd --user 4026531840 mnt 3 28459 vagrant /lib/systemd/systemd --user 4026531992 net 3 28459 vagrant /lib/systemd/systemd --user
This looks nice and neat, and there is one namespace for each of the types I mentioned previously. Sadly, this is an incomplete picture! The man page for lsns
tells us that it “reads information directly from the /proc filesystem and for non-root users it may return incomplete information.” Let’s see what you get when you run as root:
vagrant@myhost:~$ sudo lsns NS TYPE NPROCS PID USER COMMAND 4026531835 cgroup 93 1 root /sbin/init 4026531836 pid 93 1 root /sbin/init 4026531837 user 93 1 root /sbin/init 4026531838 uts 93 1 root /sbin/init 4026531839 ipc 93 1 root /sbin/init 4026531840 mnt 89 1 root /sbin/init 4026531860 mnt 1 15 root kdevtmpfs 4026531992 net 93 1 root /sbin/init 4026532170 mnt 1 14040 root /lib/systemd/systemd-udevd 4026532171 mnt 1 451 systemd-network /lib/systemd/systemd-networkd 4026532190 mnt 1 617 systemd-resolve /lib/systemd/systemd-resolved
The root user can see some additional mount namespaces, and there are a lot more processes visible to root than were visible to the non-root user. The reason to show you this is to note that when we are using lsns
, we should run as root (or use sudo
) to get the complete picture.
Let’s explore how you can use namespaces to create something that behaves like what we call a “container.”
Note
The examples in this chapter use Linux shell commands to create a container. If you would like to try creating a container using the Go programming language, you will find instructions at https://github.com/lizrice/containers-from-scratch.
Isolating the Hostname
Let’s start with the namespace for the Unix Timesharing System (UTS). As mentioned previously, this covers the hostname and domain names. By putting a process in its own UTS namespace, you can change the hostname for this process independently of the hostname of the machine or virtual machine on which it’s running.
If you open a terminal on Linux, you can see the hostname:
vagrant@myhost:~$ hostname myhost
Most (perhaps all?) container systems give each container a random ID. By default this ID is used as the hostname. You can see this by running a container and getting shell access. For example, in Docker you could do the following:
vagrant@myhost:~$ docker run --rm -it --name hello ubuntu bash root@cdf75e7a6c50:/$ hostname cdf75e7a6c50
Incidentally, you can see in this example that even if you give the container a name in Docker (here I specified --name hello
), that name isn’t used for the hostname of the container.
The container can have its own hostname because Docker created it with its own UTS namespace. You can explore the same thing by using the unshare
command to create a process that has a UTS namespace of its own.
As it’s described on the man page (seen by running man unshare
), unshare
lets you “run a program with some namespaces unshared from the parent.” Let’s dig a little deeper into that description. When you “run a program,” the kernel creates a new process and executes the program in it. This is done from the context of a running process—the parent—and the new process will be referred to as the child. The word “unshare” means that, rather than sharing namespaces of its parent, the child is going to be given its own.
Let’s give it a try. You need to have root privileges to do this, hence the sudo
at the start of the line:
vagrant@myhost:~$ sudo unshare --uts sh $ hostname myhost $ hostname experiment $ hostname experiment $ exit vagrant@myhost:~$ hostname myhost
This runs a sh
shell in a new process that has a new UTS namespace. Any programs you run inside the shell will inherit its namespaces. When you run the hostname
command, it executes in the new UTS namespace that has been isolated from that of the host machine.
If you were to open another terminal window to the same host before the exit
, you could confirm that the hostname hasn’t changed for the whole (virtual) machine. You can change the hostname on the host without affecting the hostname that the namespaced process is aware of, and vice versa.
This is a key component of the way containers work. Namespaces give them a set of resources (in this case the hostname) that are independent of the host machine, and of other containers. But we are still talking about a process that is being run by the same Linux kernel. This has security implications that I’ll discuss later in the chapter. For now, let’s look at another example of a namespace by seeing how you can give a container its own view of running processes.
Isolating Process IDs
If you run the ps
command inside a Docker container, you can see only the processes running inside that container and none of the processes running on the host:
vagrant@myhost:~$ docker run --rm -it --name hello ubuntu bash root@cdf75e7a6c50:/$ ps -eaf UID PID PPID C STIME TTY TIME CMD root 1 0 0 18:41 pts/0 00:00:00 bash root 10 1 0 18:42 pts/0 00:00:00 ps -eaf root@cdf75e7a6c50:/$ exit vagrant@myhost:~$
This is achieved with the process ID namespace, which restricts the set of process IDs that are visible. Try running unshare
again, but this time specifying that you want a new PID namespace with the --pid
flag:
vagrant@myhost:~$ sudo unshare --pid sh $ whoami root $ whoami sh: 2: Cannot fork $ whoami sh: 3: Cannot fork $ ls sh: 4: Cannot fork $ exit vagrant@myhost:~$
This doesn’t seem very successful—it’s not possible to run any commands after the first whoami
! But there are some interesting artifacts in this output.
The first process under sh
seems to have worked OK, but every command after that fails due to an inability to fork. The error is output in the form <command>: <process ID>: <message>
, and you can see that the process IDs are incrementing each time. Given the sequence, it would be reasonable to assume that the first whoami
ran as process ID 1. That is a clue that the PID namespace is working in some fashion, in that the process ID numbering has restarted. But it’s pretty much useless if you can’t run more than one process!
There are clues to what the problem is in the description of the --fork
flag in the man page for unshare
: “Fork the specified program as a child process of unshare rather than running it directly. This is useful when creating a new pid namespace.”
You can explore this by running ps
to view the process hierarchy from a second terminal window:
vagrant@myhost:~$ ps fa PID TTY STAT TIME COMMAND ... 30345 pts/0 Ss 0:00 -bash 30475 pts/0 S 0:00 \_ sudo unshare --pid sh 30476 pts/0 S 0:00 \_ sh
The sh
process is not a child of unshare
; it’s a child of the sudo
process.
Now try the same thing with the --fork
parameter:
vagrant@myhost:~$ sudo unshare --pid --fork sh $ whoami root $ whoami root
This is progress, in that you can now run more than one command before running into the “Cannot fork” error. If you look at the process hierarchy again from a second terminal, you’ll see an important difference:
vagrant@myhost:~$ ps fa PID TTY STAT TIME COMMAND ... 30345 pts/0 Ss 0:00 -bash 30470 pts/0 S 0:00 \_ sudo unshare --pid --fork sh 30471 pts/0 S 0:00 \_ unshare --pid --fork sh 30472 pts/0 S 0:00 \_ sh ...
With the --fork
parameter, the sh
shell is running as a child of the unshare
process, and you can successfully run as many different child commands as you choose within this shell.
Given that the shell is within its own process ID namespace, the results of running ps
inside it might be surprising:
vagrant@myhost:~$ sudo unshare --pid --fork sh $ ps PID TTY TIME CMD 14511 pts/0 00:00:00 sudo 14512 pts/0 00:00:00 unshare 14513 pts/0 00:00:00 sh 14515 pts/0 00:00:00 ps $ ps -eaf UID PID PPID C STIME TTY TIME CMD root 1 0 0 Mar27 ? 00:00:02 /sbin/init root 2 0 0 Mar27 ? 00:00:00 [kthreadd] root 3 2 0 Mar27 ? 00:00:00 [ksoftirqd/0] root 5 2 0 Mar27 ? 00:00:00 [kworker/0:0H] ...many more lines of output about processes... $ exit vagrant@myhost:~$
As you can see, ps
is still showing all the processes on the whole host, despite running inside a new process ID namespace. If you want the ps
behavior that you would see in a Docker container, it’s not sufficient just to use a new process ID namespace, and the reason for this is included in the man page for ps
: “This ps works by reading the virtual files in /proc.”
Let’s take a look at the /proc
directory to see what virtual files this is referring to. Your system will look similar, but not exactly the same, as it will be running a different set of processes:
vagrant@myhost:~$ ls /proc 1 14553 292 467 cmdline modules 10 14585 3 5 consoles mounts 1009 14586 30087 53 cpuinfo mpt 1010 14664 30108 538 crypto mtrr 1015 14725 30120 54 devices net 1016 14749 30221 55 diskstats pagetypeinfo 1017 15 30224 56 dma partitions 1030 156 30256 57 driver sched_debug 1034 157 30257 58 execdomains schedstat 1037 158 30283 59 fb scsi 1044 159 313 60 filesystems self 1053 16 314 61 fs slabinfo 1063 160 315 62 interrupts softirqs 1076 161 34 63 iomem stat 1082 17 35 64 ioports swaps 11 18 3509 65 irq sys 1104 19 3512 66 kallsyms sysrq-trigger 1111 2 36 7 kcore sysvipc 1175 20 37 72 keys thread-self 1194 21 378 8 key-users timer_list 12 22 385 85 kmsg timer_stats 1207 23 392 86 kpagecgroup tty 1211 24 399 894 kpagecount uptime 1215 25 401 9 kpageflags version 12426 26 403 966 loadavg version_signature 125 263 407 acpi locks vmallocinfo 13 27 409 buddyinfo mdstat vmstat 14046 28 412 bus meminfo zoneinfo 14087 29 427 cgroups misc
Every numbered directory in /proc
corresponds to a process ID, and there is a lot of interesting information about a process inside its directory. For example, /proc/<pid>/exe
is a symbolic link to the executable that’s being run inside this particular process, as you can see in the following example:
vagrant@myhost:~$ ps PID TTY TIME CMD 28441 pts/1 00:00:00 bash 28558 pts/1 00:00:00 ps vagrant@myhost:~$ ls /proc/28441 attr fdinfo numa_maps smaps autogroup gid_map oom_adj smaps_rollup auxv io oom_score stack cgroup limits oom_score_adj stat clear_refs loginuid pagemap statm cmdline map_files patch_state status comm maps personality syscall coredump_filter mem projid_map task cpuset mountinfo root timers cwd mounts sched timerslack_ns environ mountstats schedstat uid_map exe net sessionid wchan fd ns setgroups vagrant@myhost:~$ ls -l /proc/28441/exe lrwxrwxrwx 1 vagrant vagrant 0 Oct 10 13:32 /proc/28441/exe -> /bin/bash
Irrespective of the process ID namespace it’s running in, ps
is going to look in /proc
for information about running processes. In order to have ps
return only the information about the processes inside the new namespace, there needs to be a separate copy of the /proc
directory, where the kernel can write information about the namespaced processes. Given that /proc
is a directory directly under root, this means changing the root directory.
Changing the Root Directory
From within a container, you don’t see the host’s entire filesystem; instead, you see a subset, because the root directory gets changed as the container is created.
You can change the root directory in Linux with the chroot
command. This effectively moves the root directory for the current process to point to some other location within the filesystem. Once you have done a chroot
command, you lose access to anything that was higher in the file hierarchy than your current root directory, since there is no way to go any higher than root within the filesystem, as illustrated in Figure 4-1.
The description in chroot
’s man page reads as follows: “Run COMMAND with root directory set to NEWROOT. […] If no command is given, run ${SHELL} -i (default: /bin/sh -i).”
From this you can see that chroot
doesn’t just change the directory, but also runs a command, falling back to running a shell if you don’t specify a different command.
Create a new directory and try to chroot
into it:
vagrant@myhost:~$ mkdir new_root vagrant@myhost:~$ sudo chroot new_root chroot: failed to run command ‘/bin/bash’: No such file or directory vagrant@myhost:~$ sudo chroot new_root ls chroot: failed to run command ‘ls’: No such file or directory
This doesn’t work! The problem is that once you are inside the new root directory, there is no bin
directory inside this root, so it’s impossible to run the /bin/bash
shell. Similarly, if you try to run the ls
command, it’s not there. You’ll need the files for any commands you want to run to be available within the new root. This is exactly what happens in a “real” container: the container is instantiated from a container image, which encapsulates the filesystem that the container sees. If an executable isn’t present within that filesystem, the container won’t be able to find and run it.
Why not try running Alpine Linux within your container? Alpine is a fairly minimal Linux distribution designed for containers. You’ll need to start by downloading the filesystem:
vagrant@myhost:~$ mkdir alpine vagrant@myhost:~$ cd alpine vagrant@myhost:~/alpine$ curl -o alpine.tar.gz http://dl-cdn.alpinelinux.org/ alpine/v3.10/releases/x86_64/alpine-minirootfs-3.10.0-x86_64.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2647k 100 2647k 0 0 16.6M 0 --:--:-- --:--:-- --:--:-- 16.6M vagrant@myhost:~/alpine$ tar xvf alpine.tar.gz
At this point you have a copy of the Alpine filesystem inside the alpine
directory you created. Remove the compressed version and move back to the parent directory:
vagrant@myhost:~/alpine$ rm alpine.tar.gz vagrant@myhost:~/alpine$ cd ..
You can explore the contents of the filesystem with ls alpine
to see that it looks like the root of a Linux filesystem with directories such as bin
, lib
, var
, tmp
, and so on.
Now that you have the Alpine distribution unpacked, you can use chroot
to move into the alpine
directory, provided you supply a command that exists within that directory’s hierarchy.
It’s slightly more subtle than that, because the executable has to be in the new process’s path. This process inherits the parent’s environment, including the PATH
environment variable. The bin
directory within alpine
has become /bin
for the new process, and assuming that your regular path includes /bin
, you can pick up the ls
executable from that directory without specifying its path explicitly:
vagrant@myhost:~$ sudo chroot alpine ls bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr vagrant@myhost:~$
Notice that it is only the child process (in this example, the process that ran ls
) that gets the new root directory. When that process finishes, control returns to the parent process. If you run a shell as the child process, it won’t complete immediately, so that makes it easier to see the effects of changing the root directory:
vagrant@myhost:~$ sudo chroot alpine sh / $ ls bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr / $ whoami root / $ exit vagrant@myhost:~$
If you try to run the bash
shell, it won’t work. This is because the Alpine distribution doesn’t include it, so it’s not present inside the new root directory. If you tried the same thing with the filesystem of a distribution like Ubuntu, which does include bash
, it would work.
To summarize, chroot
literally “changes the root” for a process. After changing the root, the process (and its children) will be able to access only the files and directories that are lower in the hierarchy than the new root directory.
Note
In addition to chroot
, there is a system call called pivot_root
. For the purposes of this chapter, whether chroot
or pivot_root
is used is an implementation detail; the key point is that a container needs to have its own root directory. I have used chroot
in these examples because it is slightly simpler and more familiar to many people.
There are security advantages to using pivot_root
over chroot
, so in practice you should find the former if you look at the source code of a container runtime implementation. The main difference is that pivot_root
takes advantage of the mount namespace; the old root is no longer mounted and is therefore no longer accessible within that mount namespace. The chroot
system call doesn’t take this approach, leaving the old root accessible via mount points.
You have now seen how a container can be given its own root filesystem. I’ll discuss this further in Chapter 6, but right now let’s see how having its own root filesystem allows the kernel to show a container just a restricted view of namespaced resources.
Combine Namespacing and Changing the Root
So far you have seen namespacing and changing the root as two separate things, but you can combine the two by running chroot
in a new namespace:
me@myhost:~$ sudo unshare --pid --fork chroot alpine sh / $ ls bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr
If you recall from earlier in this chapter (see “Isolating Process IDs”), giving the container its own root directory allows it to create a /proc
directory for the container that’s independent of /proc
on the host. For this to be populated with process information, you will need to mount it as a pseudofilesystem of type proc
. With the combination of a process ID namespace and an independent /proc
directory, ps
will now show just the processes that are inside the process ID namespace:
/ $ mount -t proc proc proc / $ ps PID USER TIME COMMAND 1 root 0:00 sh 6 root 0:00 ps / $ exit vagrant@myhost:~$
Success! It has been more complex than isolating the container’s hostname, but through the combination of creating a process ID namespace, changing the root directory, and mounting a pseudofilesystem to handle process information, you can limit a container so that it has a view only of its own processes.
There are more namespaces left to explore. Let’s see the mount namespace next.
Mount Namespace
Typically you don’t want a container to have all the same filesystem mounts as its host. Giving the container its own mount namespace achieves this separation.
Here’s an example that creates a simple bind mount for a process with its own mount namespace:
vagrant@myhost:~$ sudo unshare --mount sh $ mkdir source $ touch source/HELLO $ ls source HELLO $ mkdir target $ ls target $ mount --bind source target $ ls target HELLO
Once the bind mount is in place, the contents of the source
directory are also available in target
. If you look at all the mounts from within this process, there will probably be a lot of them, but the following command finds the target you created if you followed the preceding example:
$ findmnt target TARGET SOURCE FSTYPE OPTIONS /home/vagrant/target /dev/mapper/vagrant--vg-root[/home/vagrant/source] ext4 rw,relatime,errors=remount-ro,data=ordered
From the host’s perspective, this isn’t visible, which you can prove by running the same command from another terminal window and confirming that it doesn’t return anything.
Try running findmnt
from within the mount namespace again, but this time without any parameters, and you will get a long list. You might be thinking that it seems wrong for a container to be able to see all the mounts on the host. This is a very similar situation to what you saw with the process ID namespace: the kernel uses the /proc/<PID>/mounts
directory to communicate information about mount points for each process. If you create a process with its own mount namespace but it is using the host’s /proc
directory, you’ll find that its /proc/<PID>/mounts file includes all the preexisting host mounts. (You can simply cat
this file to get a list of mounts.)
To get a fully isolated set of mounts for the containerized process, you will need to combine creating a new mount namespace with a new root filesystem and a new proc
mount, like this:
vagrant@myhost:~$ sudo unshare --mount chroot alpine sh / $ mount -t proc proc proc / $ mount proc on /proc type proc (rw,relatime) / $ mkdir source / $ touch source/HELLO / $ mkdir target / $ mount --bind source target / $ mount proc on /proc type proc (rw,relatime) /dev/sda1 on /target type ext4 (rw,relatime,data=ordered)
Alpine Linux doesn’t come with the findmnt
command, so this example uses mount
with no parameters to generate the list of mounts. (If you are cynical about this change, try the earlier example with mount
instead of findmnt
to check that you get the same results.)
You may be familiar with the concept of mounting host directories into a container using docker run -v <host directory>:<container directory> ...
. To achieve this, after the root filesystem has been put in place for the container, the target container directory is created and then the source host directory gets bind mounted into that target. Because each container has its own mount namespace, host directories mounted like this are not visible from other containers.
Note
If you create a mount that is visible to the host, it won’t automatically get cleaned up when your “container” process terminates. You will need to destroy it using umount
. This also applies to the /proc
pseudofilesystems. They won’t do any particular harm, but if you like to keep things tidy, you can remove them with umount proc
. The system won’t let you unmount the final /proc
used by the host.
Network Namespace
The network namespace allows a container to have its own view of network interfaces and routing tables. When you create a process with its own network namespace, you can see it with lsns
:
vagrant@myhost:~$ sudo lsns -t net NS TYPE NPROCS PID USER NETNSID NSFS COMMAND 4026531992 net 93 1 root unassigned /sbin/init vagrant@myhost:~$ sudo unshare --net bash root@myhost:~$ lsns -t net NS TYPE NPROCS PID USER NETNSID NSFS COMMAND 4026531992 net 92 1 root unassigned /sbin/init 4026532192 net 2 28586 root unassigned bash
Note
You might come across the ip netns
command, but that is not much use to us here. Using unshare --net
creates an anonymous network namespace, and anonymous namespaces don’t appear in the output from ip netns list
.
When you put a process into its own network namespace, it starts with just the loopback interface:
vagrant@myhost:~$ sudo unshare --net bash root@myhost:~$ ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
With nothing but a loopback interface, your container won’t be able to communicate. To give it a path to the outside world, you create a virtual Ethernet interface—or more strictly, a pair of virtual Ethernet interfaces. These act as if they were the two ends of a metaphorical cable connecting your container namespace to the default network namespace.
In a second terminal window, as root, you can create a virtual Ethernet pair by specifying the anonymous namespaces associated with their process IDs, like this:
root@myhost:~$ ip link add ve1 netns 28586 type veth peer name ve2 netns 1
-
ip link add
indicates that you want to add a link. -
ve1
is the name of one “end” of the virtual Ethernet “cable.” -
netns 28586
says that this end is “plugged in” to the network namespace associated with process ID 28586 (which is shown in the output fromlsns -t net
in the example at the start of this section). -
peer name ve2
gives the name of the other end of the “cable.” -
netns 1
specifies that this second end is “plugged in” to the network namespace associated with process ID 1.
The ve1
virtual Ethernet interface is now visible from inside the “container” process:
root@myhost:~$ ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ve1@if3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group ... link/ether 7a:8a:3f:ba:61:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0
The link is in “DOWN” state and needs to be brought up before it’s any use. Both ends of the connection need to be brought up.
Bring up the ve2
end on the host:
root@myhost:~$ ip link set ve2 up
And once you bring up the ve1
end in the container, the link should move to “UP” state:
root@myhost:~$ ip link set ve1 up root@myhost:~$ ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ve1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP ... link/ether 7a:8a:3f:ba:61:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::788a:3fff:feba:612c/64 scope link valid_lft forever preferred_lft forever
To send IP traffic, there needs to an IP address associated with its interface. In the container:
root@myhost:~$ ip addr add 192.168.1.100/24 dev ve1
root@myhost:~$ ip addr add 192.168.1.200/24 dev ve1
This will also have the effect of adding an IP route into the routing table in the container:
root@myhost:~$ ip route 192.168.1.0/24 dev ve1 proto kernel scope link src 192.168.1.100
As mentioned at the start of this section, the network namespace isolates both the interfaces and the routing table, so this routing information is independent of the IP routing table on the host. At this point the container can send traffic only to 192.168.1.0/24
addresses. You can test this with a ping from within the container to the remote end:
root@myhost:~$ ping 192.168.1.100 PING 192.168.1.100 (192.168.1.100) 56(84) bytes of data. 64 bytes from 192.168.1.100: icmp_seq=1 ttl=64 time=0.355 ms 64 bytes from 192.168.1.100: icmp_seq=2 ttl=64 time=0.035 ms ^C
We will dig further into networking and container network security in Chapter 10.
User Namespace
The user namespace allows processes to have their own view of user and group IDs. Much like process IDs, the users and groups still exist on the host, but they can have different IDs. The main benefit of this is that you can map the root ID of 0 within a container to some other non-root identity on the host. This is a huge advantage from a security perspective, since it allows software to run as root inside a container, but an attacker who escapes from the container to the host will have a non-root, unprivileged identity. As you’ll see in Chapter 9, it’s not hard to misconfigure a container to make it easy escape to the host. With user namespaces, you’re not just one false move away from host takeover.
Note
As of this writing, user namespaces are not in particularly common use yet. This feature is not turned on by default in Docker (see “User Namespace Restrictions in Docker”), and it is not supported at all in Kubernetes, though it has been under discussion.
Generally speaking, you need to be root to create new namespaces, which is why the Docker daemon runs as root, but the user namespace is an exception:
vagrant@myhost:~$ unshare --user bash nobody@myhost:~$ id uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup) nobody@myhost:~$ echo $$ 31196
Inside the new user namespace the user has the nobody
ID. You need to put in place a mapping between user IDs inside and outside the namespace, as shown in Figure 4-2.
This mapping exists in /proc/<pid>/uid_map
, which you can edit as root (on the host). There are three fields in this file:
-
The lowest ID to map from the child process’s perspective
-
The lowest corresponding ID that this should map to on the host
-
The number of IDs to be mapped
As an example, on my machine, the vagrant
user has ID 1000. In order to have vagrant
get assigned the root ID of 0 inside the child process, the first two fields are 0 and 1000. The last field can be 1 if you want to map only one ID (which may well be the case if you want only one user inside the container). Here’s the command I used to set up that mapping:
vagrant@myhost:~$ sudo echo '0 1000 1' > /proc/31196/uid_map
Immediately, inside its user namespace, the process has taken on the root identity. Don’t be put off by the fact that the bash prompt still says “nobody”; this doesn’t get updated unless you rerun the scripts that get run when you start a new shell (e.g., ~/.bash_profile
):
nobody@myhost:~$ id uid=0(root) gid=65534(nogroup) groups=65534(nogroup)
A similar mapping process is used to map the group(s) used inside the child process.
This process is now running with a large set of capabilities:
nobody@myhost:~$ capsh --print | grep Current Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid, cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable, cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock, cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace, cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource, cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write, cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog, cap_wake_alarm,cap_block_suspend,cap_audit_read+ep
As you saw in Chapter 2, capabilities grant the process various permissions. When you create a new user namespace, the kernel gives the process all these capabilities so that the pseudo root user inside the namespace is allowed to create other namespaces, set up networking, and so on, fulfilling everything else required to make it a real container.
In fact, if you simultaneously create a process with several new namespaces, the user namespace will be created first so that you have the full capability set that permits you to create other namespaces:
vagrant@myhost:~$ unshare --uts bash unshare: unshare failed: Operation not permitted vagrant@myhost:~$ unshare --uts --user bash nobody@myhost:~$
User namespaces allow an unprivileged user to effectively become root within the containerized process. This allows a normal user to run containers using a concept called rootless containers, which we will cover in Chapter 9.
The general consensus is that user namespaces are a security benefit because fewer containers need to run as “real” root (that is, root from the erspective). However, there have been a few vulnerabilities (for example, CVE-2018-18955) directly related to privileges being incorrectly transformed while transitioning to or from a user namespace. The Linux kernel is a complex piece of software, and you should expect that people will find problems in it from time to time.
User Namespace Restrictions in Docker
You can enable the use of user namespaces in Docker, but it’s not turned on by default because it is incompatible with a few things that Docker users might want to do.
The following will also affect you if you use user namespaces with other container runtimes:
-
User namespaces are incompatible with sharing a process ID or network namespace with the host.
-
Even if the process is running as root inside the container, it doesn’t really have full root privileges. It doesn’t, for example, have
CAP_NET_BIND_SERVICE
, so it can’t bind to a low-numbered port. (See Chapter 2 for more information about Linux capabilities.) -
When the containerized process interacts with a file, it will need appropriate permissions (for example, write access in order to modify the file). If the file is mounted from the host, it is the effective user ID on the host that matters.
This is a good thing in terms of protecting the host files from unauthorized access from within a container, but it can be confusing if, say, what appears to be root inside the container is not permitted to modify a file.
Inter-process Communications Namespace
In Linux it’s possible to communicate between different processes by giving them access to a shared range of memory, or by using a shared message queue. The two processes need to be members of the same inter-process communications (IPC) namespace for them to have access to the same set of identifiers for these mechanisms.
Generally speaking, you don’t want your containers to be able to access one another’s shared memory, so they are given their own IPC namespaces.
You can see this in action by creating a shared memory block and then viewing the current IPC status with ipcs
:
$ ipcmk -M 1000 Shared memory id: 98307 $ ipcs ------ Message Queues -------- key msqid owner perms used-bytes messages ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 0 root 644 80 2 0x00000000 32769 root 644 16384 2 0x00000000 65538 root 644 280 2 0xad291bee 98307 ubuntu 644 1000 0 ------ Semaphore Arrays -------- key semid owner perms nsems 0x000000a7 0 root 600 1
In this example, the newly created shared memory block (with its ID in the shmid
column) appears as the last item in the “Shared Memory Segments” block. There are also some preexisting IPC objects that had previously been created by root.
A process with its own IPC namespace does not see any of these IPC objects:
$ sudo unshare --ipc sh $ ipcs ------ Message Queues -------- key msqid owner perms used-bytes messages ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status ------ Semaphore Arrays -------- key semid owner perms nsems
Cgroup Namespace
The last of the namespaces (at least, at the time of writing this book) is the cgroup namespace. This is a little bit like a chroot for the cgroup filesystem; it stops a process from seeing the cgroup configuration higher up in the hierarchy of cgroup directories than its own cgroup.
Note
Most namespaces were added by Linux kernel version 3.8, but the cgroup namespace was added later in version 4.6. If you’re using a relatively old distribution of Linux (such as Ubuntu 16.04), you won’t have support for this feature. You can check the kernel version on your Linux host by running uname -r
.
You can see the cgroup namespace in action by comparing the contents of /proc/self/cgroup
outside and then inside a cgroup namespace:
vagrant@myhost:~$ cat /proc/self/cgroup 12:cpu,cpuacct:/ 11:cpuset:/ 10:hugetlb:/ 9:blkio:/ 8:memory:/user.slice/user-1000.slice/session-51.scope 7:pids:/user.slice/user-1000.slice/session-51.scope 6:freezer:/ 5:devices:/user.slice 4:net_cls,net_prio:/ 3:rdma:/ 2:perf_event:/ 1:name=systemd:/user.slice/user-1000.slice/session-51.scope 0::/user.slice/user-1000.slice/session-51.scope vagrant@myhost:~$ vagrant@myhost:~$ sudo unshare --cgroup bash root@myhost:~# cat /proc/self/cgroup 12:cpu,cpuacct:/ 11:cpuset:/ 10:hugetlb:/ 9:blkio:/ 8:memory:/ 7:pids:/ 6:freezer:/ 5:devices:/ 4:net_cls,net_prio:/ 3:rdma:/ 2:perf_event:/ 1:name=systemd:/ 0::/
You have now explored all the different types of namespace and have seen how they are used along with chroot
to isolate a process’s view of its surrounding. Combine this with what you learned about cgroups in the previous chapter, and you should have a good understanding of everything that’s needed to make what we call a "container.”
Before moving on to the next chapter, it’s worth taking a look at a container from the perspective of the host it’s running on.
Container Processes from the Host Perspective
Although they are called containers, it might be more accurate to use the term “containerized processes.” A container is still a Linux process running on the host machine, but it has a limited view of that host machine, and it has access to only a subtree of the filesystem and perhaps to a limited set of resources restricted by cgroups. Because it’s really just a process, it exists within the context of the host operating system, and it shares the host’s kernel as shown in Figure 4-3.
You’ll see how this compares to virtual machines in the next chapter, but before that, let’s examine in more detail the extent to which a containerized process is isolated from the host, and from other containerized processes on that host, by trying some experiments on a Docker container. Start a container process based on Ubuntu (or your favorite Linux distribution) and run a shell in it, and then run a long sleep
in it as follows:
$ docker run --rm -it ubuntu bash root@1551d24a $ sleep 1000
This example runs the sleep
command for 1,000 seconds, but note that the sleep
command is running as a process inside the container. When you press Enter at the end of the sleep
command, this triggers Linux to clone a new process with a new process ID and to run the sleep
executable within that process.
You can put the sleep process into the background (Ctrl-Z
to pause the process, and bg %1
to background it). Now run ps
inside the container to see the same process from the container’s perspective:
me@myhost:~$ docker run --rm -it ubuntu bash root@ab6ea36fce8e:/$ sleep 1000 ^Z [1]+ Stopped sleep 1000 root@ab6ea36fce8e:/$ bg %1 [1]+ sleep 1000 & root@ab6ea36fce8e:/$ ps PID TTY TIME CMD 1 pts/0 00:00:00 bash 10 pts/0 00:00:00 sleep 11 pts/0 00:00:00 ps root@ab6ea36fce8e:/$
While that sleep
command is still running, open a second terminal into the same host and look at the same sleep process from the host’s perspective:
me@myhost:~$ ps -C sleep PID TTY TIME CMD 30591 pts/0 00:00:00 sleep
The -C sleep
parameter specifies that we are interested only in processes running the sleep
executable.
The container has its own process ID namespace, so it makes sense that its processes would have low numbers, and that is indeed what you see when running ps
in the container. From the host’s perspective, however, the sleep process has a different, high-numbered process ID. In the preceding example, there is just one process, and it has ID 30591 on the host and 10 in the container. (The actual number will vary according to what else is and has been running on the same machine, but it’s likely to be a much higher number.)
To get a good understanding of containers and the level of isolation they provide, it’s really key to get to grips with the fact that although there are two different process IDs, they both refer to the same process. It’s just that from the host’s perspective it has a higher process ID number.
The fact that container processes are visible from the host is one of the fundamental differences between containers and virtual machines. An attacker who gets access to the host can observe and affect all the containers running on that host, especially if they have root access. And as you’ll see in Chapter 9, there are some remarkably easy ways you can inadvertently make it possible for an attacker to move from a compromised container onto the host.
Container Host Machines
As you have seen, containers and their host share a kernel, and this has some consequences for what are considered best practices relating to the host machines for containers. If a host gets compromised, all the containers on that host are potential victims, especially if the attacker gains root or otherwise elevated privileges (such as being a member of the docker
group that can administer containers where Docker is used as the runtime).
It’s highly recommended to run container applications on dedicated host machines (whether they be VMs or bare metal), and the reasons mostly relate to security:
-
Using an orchestrator to run containers means that humans need little or no access to the hosts. If you don’t run any other applications, you will need a very small set of user identities on the host machines. These will be easier to manage, and attempts to log in as an unauthorized user will be easier to spot.
-
You can use any Linux distribution as the host OS for running Linux containers, but there are several “Thin OS” distros specifically designed for running containers. These reduce the host attack surface by including only the components required to run containers. Examples include RancherOS, Red Hat’s Fedora CoreOS, and VMware’s Photon OS. With fewer components included in the host machine, there is a smaller chance of vulnerabilities (see Chapter 7) in those components.
-
All the host machines in a cluster can share the same configuration, with no application-specific requirements. This makes it easy to automate the provisioning of host machines, and it means you can treat host machines as immutable. If a host machine needs an upgrade, you don’t patch it; instead, you remove it from the cluster and replace it with a freshly installed machine. Treating hosts as immutable makes intrusions easier to detect.
I’ll come back to the advantages of immutability in Chapter 6.
Using a Thin OS reduces the set of configuration options but doesn’t eliminate them completely. For example, you will have a container runtime (perhaps Docker) plus orchestrator code (perhaps the Kubernetes kubelet) running on every host. These components have numerous settings, some of which affect security. The Center for Internet Security (CIS) publishes benchmarks for best practices for configuring and running various software components, including Docker, Kubernetes, and Linux.
In an enterprise environment, look for a container security solution that also protects the hosts by reporting on vulnerabilities and worrisome configuration settings. You will also want logs and alerts for logins and login attempts at the host level.
Summary
Congratulations! Since you’ve reached the end of this chapter, you should now know what a container really is. You’ve seen the three essential Linux kernel mechanisms that are used to limit a process’s access to host resources:
-
Namespaces limit what the container process can see—for example, by giving the container an isolated set of process IDs.
-
Changing the root limits the set of files and directories that the container can see.
As you saw in Chapter 1, isolating one workload from another is an important aspect of container security. You now should be fully aware that all the containers on a given host (whether it is a virtual machine or a bare-metal server) share the same kernel. Of course, the same is true in a multiuser system where different users can log in to the same machine and run applications directly. However, in a multiuser system, the administrators are likely to limit the permissions given to each user; they certainly won’t give them all root privileges. With containers—at least at the time of writing—they all run as root by default and are relying on the boundary provided by namespaces, changed root directories, and cgroups to prevent one container from interfering with another.
Note
Now that you know how containers work, you might want to explore Jess Frazelle’s contained.af site to see just how effective they are. Will you be the person who breaks the containment?
In Chapter 8 we’ll explore options for strengthening the security boundary around each container, but next let’s delve into how virtual machines work. This will allow you to consider the relative strengths of the isolation between containers and between VMs, especially through the lens of security.
Get Container Security now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.