Memory mapping is one of the most interesting features of modern Unix systems. As far as drivers are concerned, memory mapping can be used to provide user programs with direct access to device memory.
A definitive example of mmap usage can be seen by looking at a subset of the virtual memory areas for the X Window System server:
cat /proc/731/maps
08048000-08327000 r-xp 00000000 08:01 55505 /usr/X11R6/bin/XF86_SVGA
08327000-08369000 rw-p 002de000 08:01 55505 /usr/X11R6/bin/XF86_SVGA
40015000-40019000 rw-s fe2fc000 08:01 10778 /dev/mem
40131000-40141000 rw-s 000a0000 08:01 10778 /dev/mem
40141000-40941000 rw-s f4000000 08:01 10778 /dev/mem
...
The full list of the X server’s VMAs is lengthy, but most of the
entries are not of interest here. We do see, however, three separate
mappings of /dev/mem
, which give some insight
into how the X server works with the video card. The first mapping
shows a 16 KB region mapped at fe2fc000
. This
address is far above the highest RAM address on the system; it is,
instead, a region of memory on a PCI peripheral (the video card). It
will be a control region for that card. The middle mapping is at
a0000
, which is the standard location for video RAM
in the 640 KB ISA hole. The last /dev/mem
mapping is a rather larger one at f4000000
and is
the video memory itself. These regions can also be seen in
/proc/iomem
:
000a0000-000bffff : Video RAM area f4000000-f4ffffff : Matrox Graphics, Inc. MGA G200 AGP fe2fc000-fe2fffff : Matrox Graphics, Inc. MGA G200 AGP
Mapping a device means associating a range of user-space addresses to device memory. Whenever the program reads or writes in the assigned address range, it is actually accessing the device. In the X server example, using mmap allows quick and easy access to the video card’s memory. For a performance-critical application like this, direct access makes a large difference.
As you might suspect, not every device lends itself to the
mmap abstraction; it makes no sense, for
instance, for serial ports and other stream-oriented devices. Another
limitation of mmap is that mapping is
PAGE_SIZE
grained. The kernel can dispose of
virtual addresses only at the level of page tables; therefore, the
mapped area must be a multiple of PAGE_SIZE
and
must live in physical memory starting at an address that is a multiple
of PAGE_SIZE
. The kernel accommodates for size
granularity by making a region slightly bigger if its size isn’t a
multiple of the page size.
These limits are not a big constraint for drivers, because the program
accessing the device is device dependent anyway. It needs to know how
to make sense of the memory region being mapped, so the
PAGE_SIZE
alignment is not a problem. A bigger
constraint exists when ISA devices are used on some non-x86 platforms,
because their hardware view of ISA may not be contiguous. For
example, some Alpha computers see ISA memory as a scattered set of
8-bit, 16-bit, or 32-bit items, with no direct mapping. In such
cases, you can’t use mmap at all. The inability
to perform direct mapping of ISA addresses to Alpha addresses is due
to the incompatible data transfer specifications of the two systems.
Whereas early Alpha processors could issue only 32-bit and 64-bit
memory accesses, ISA can do only 8-bit and 16-bit transfers, and
there’s no way to transparently map one protocol onto the other.
There are sound advantages to using mmap when it’s feasible to do so. For instance, we have already looked at the X server, which transfers a lot of data to and from video memory; mapping the graphic display to user space dramatically improves the throughput, as opposed to an lseek/write implementation. Another typical example is a program controlling a PCI device. Most PCI peripherals map their control registers to a memory address, and a demanding application might prefer to have direct access to the registers instead of repeatedly having to call ioctl to get its work done.
The mmap method is part of the
file_operations
structure and is invoked when the
mmap system call is issued. With
mmap, the kernel performs a good deal of work
before the actual method is invoked, and therefore the prototype of
the method is quite different from that of the system call. This is
unlike calls such as ioctl and
poll, where the kernel does not do much before
calling the method.
The system call is declared as follows (as described in the mmap(2) manual page):
mmap (caddr_t addr, size_t len, int prot, int flags, int fd, off_t offset)
On the other hand, the file operation is declared as
int (*mmap) (struct file *filp, struct vm_area_struct *vma);
The filp
argument in the method is the same as that
introduced in Chapter 3, while vma
contains the information about the virtual address range that is used
to access the device. Much of the work has thus been done by the
kernel; to implement mmap, the driver only has to
build suitable page tables for the address range and, if necessary,
replace vma->vm_ops
with a new set of
operations.
There are two ways of building the page tables: doing it all at once with a function called remap_page_range, or doing it a page at a time via the nopage VMA method. Both methods have their advantages. We’ll start with the “all at once” approach, which is simpler. From there we will start adding the complications needed for a real-world implementation.
The job of building new page tables to map a range of physical addresses is handled by remap_page_range, which has the following prototype:
int remap_page_range(unsigned long virt_add, unsigned long phys_add, unsigned long size, pgprot_t prot);
The value returned by the function is the usual 0 or a negative error code. Let’s look at the exact meaning of the function’s arguments:
-
virt_add
The user virtual address where remapping should begin. The function builds page tables for the virtual address range between
virt_add
andvirt_add+size
.-
phys_add
The physical address to which the virtual address should be mapped. The function affects physical addresses from
phys_add
tophys_add+size
.-
size
The dimension, in bytes, of the area being remapped.
-
prot
The “protection” requested for the new VMA. The driver can (and should) use the value found in
vma->vm_page_prot
.
The arguments to remap_page_range are fairly
straightforward, and most of them are already provided to you in the
VMA when your mmap method is called. The one
complication has to do with caching: usually, references to device
memory should not be cached by the processor. Often the system BIOS
will set things up properly, but it is also possible to disable
caching of specific VMAs via the protection field. Unfortunately,
disabling caching at this level is highly processor dependent. The
curious reader may wish to look at the function
pgprot_noncached from
drivers/char/mem.c
to see what’s involved. We
won’t discuss the topic further here.
If your driver needs to do a simple, linear mapping of device memory
into a user address space, remap_page_range is
almost all you really need to do the job. The following code comes
from drivers/char/mem.c
and shows how this task
is performed in a typical module called
simple (Simple Implementation Mapping Pages
with Little Enthusiasm):
#include <linux/mm.h> int simple_mmap(struct file *filp, struct vm_area_struct *vma) { unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; if (offset >= _ _pa(high_memory) || (filp->f_flags & O_SYNC)) vma->vm_flags |= VM_IO; vma->vm_flags |= VM_RESERVED; if (remap_page_range(vma->vm_start, offset, vma->vm_end-vma->vm_start, vma->vm_page_prot)) return -EAGAIN; return 0; }
The /dev/mem
code checks to see if the requested
offset (stored in vma->vm_pgoff
) is beyond
physical memory; if so, the VM_IO
VMA flag is set to
mark the area as being I/O memory. The VM_RESERVED
flag is always set to keep the system from trying to swap this area
out. Then it is just a matter of calling
remap_page_range to create the necessary page
tables.
As we have seen, the vm_area_struct
structure
contains a set of operations that may be applied to the VMA. Now
we’ll look at providing those operations in a simple way; a more
detailed example will follow later on.
Here, we will provide open and close operations for our VMA. These operations will be called anytime a process opens or closes the VMA; in particular, the open method will be invoked anytime a process forks and creates a new reference to the VMA. The open and close VMA methods are called in addition to the processing performed by the kernel, so they need not reimplement any of the work done there. They exist as a way for drivers to do any additional processing that they may require.
We’ll use these methods to increment the module usage count whenever the VMA is opened, and to decrement it when it’s closed. In modern kernels, this work is not strictly necessary; the kernel will not call the driver’s release method as long as a VMA remains open, so the usage count will not drop to zero until all references to the VMA are closed. The 2.0 kernel, however, did not perform this tracking, so portable code will still want to be able to maintain the usage count.
So, we will override the default vma->vm_ops
with operations that keep track of the usage count. The code is quite
simple—a complete mmap implementation for a
modularized /dev/mem
looks like the following:
void simple_vma_open(struct vm_area_struct *vma) { MOD_INC_USE_COUNT; } void simple_vma_close(struct vm_area_struct *vma) { MOD_DEC_USE_COUNT; } static struct vm_operations_struct simple_remap_vm_ops = { open: simple_vma_open, close: simple_vma_close, }; int simple_remap_mmap(struct file *filp, struct vm_area_struct *vma) { unsigned long offset = VMA_OFFSET(vma); if (offset >= __pa(high_memory) || (filp->f_flags & O_SYNC)) vma->vm_flags |= VM_IO; vma->vm_flags |= VM_RESERVED; if (remap_page_range(vma->vm_start, offset, vma->vm_end-vma->vm_start, vma->vm_page_prot)) return -EAGAIN; vma->vm_ops = &simple_remap_vm_ops; simple_vma_open(vma); return 0; }
This code relies on the fact that the kernel initializes to
NULL
the vm_ops
field in the
newly created area before calling
f_op->mmap
. The code just shown checks the
current value of the pointer as a safety measure, should something
change in future kernels.
The strange VMA_OFFSET
macro that appears in this
code is used to hide a difference in the vma
structure across kernel versions. Since the offset is a number of
pages in 2.4 and a number of bytes in 2.2 and earlier kernels,
<sysdep.h>
declares the macro to make the
difference transparent (and the result is expressed in bytes).
Although remap_page_range works well for many, if not most, driver mmap implementations, sometimes it is necessary to be a little more flexible. In such situations, an implementation using the nopage VMA method may be called for.
The nopage method, remember, has the following prototype:
struct page (*nopage)(struct vm_area_struct *vma, unsigned long address, int write_access);
When a user process attempts to access a page in a VMA that is not
present in memory, the associated nopage function
is called. The address
parameter will contain the
virtual address that caused the fault, rounded down to the beginning
of the page. The nopage function must locate and
return the struct page
pointer that refers to the
page the user wanted. This function must also take care to increment
the usage count for the page it returns by calling the
get_page macro:
get_page(struct page *pageptr);
This step is necessary to keep the reference counts correct on the mapped pages. The kernel maintains this count for every page; when the count goes to zero, the kernel knows that the page may be placed on the free list. When a VMA is unmapped, the kernel will decrement the usage count for every page in the area. If your driver does not increment the count when adding a page to the area, the usage count will become zero prematurely and the integrity of the system will be compromised.
One situation in which the nopage approach is useful can be brought about by the mremap system call, which is used by applications to change the bounding addresses of a mapped region. If the driver wants to be able to deal with mremap, the previous implementation won’t work correctly, because there’s no way for the driver to know that the mapped region has changed.
The Linux implementation of mremap doesn’t notify the driver of changes in the mapped area. Actually, it does notify the driver if the size of the area is reduced via the unmap method, but no callback is issued if the area increases in size.
The basic idea behind notifying the driver of a reduction is that the driver (or the filesystem mapping a regular file to memory) needs to know when a region is unmapped in order to take the proper action, such as flushing pages to disk. Growth of the mapped region, on the other hand, isn’t really meaningful for the driver until the program invoking mremap accesses the new virtual addresses. In real life, it’s quite common to map regions that are never used (unused sections of program code, for example). The Linux kernel, therefore, doesn’t notify the driver if the mapped region grows, because the nopage method will take care of pages one at a time as they are actually accessed.
In other words, the driver isn’t notified when a mapping grows because nopage will do it later, without having to use memory before it is actually needed. This optimization is mostly aimed at regular files, whose mapping uses real RAM.
The nopage method, therefore, must be implemented
if you want to support the mremap system
call. But once you have nopage, you can choose to
use it extensively, with some limitations (described later). This
method is shown in the next code fragment. In this implementation of
mmap, the device method only replaces
vma->vm_ops
. The nopage
method takes care of “remapping” one page at a time and returning
the address of its struct page
structure. Because
we are just implementing a window onto physical memory here, the
remapping step is simple—we need only locate and return a
pointer to the struct page
for the desired address.
An implementation of /dev/mem
using
nopage looks like the following:
struct page *simple_vma_nopage(struct vm_area_struct *vma, unsigned long address, int write_access) { struct page *pageptr; unsigned long physaddr = address - vma->vm_start + VMA_OFFSET(vma); pageptr = virt_to_page(__va(physaddr)); get_page(pageptr); return pageptr; } int simple_nopage_mmap(struct file *filp, struct vm_area_struct *vma) { unsigned long offset = VMA_OFFSET(vma); if (offset >= __pa(high_memory) || (filp->f_flags & O_SYNC)) vma->vm_flags |= VM_IO; vma->vm_flags |= VM_RESERVED; vma->vm_ops = &simple_nopage_vm_ops; simple_vma_open(vma); return 0; }
Since, once again, we are simply mapping main memory here, the
nopage function need only find the correct
struct page
for the faulting address and increment
its reference count. The required sequence of events is thus to
calculate the desired physical address, turn it into a logical address
with __va, and then finally to turn it
into a struct page
with
virt_to_page. It would be possible, in general,
to go directly from the physical address to the struct page
, but such code would be difficult to make portable
across architectures. Such code might be necessary, however, if one
were trying to map high memory, which, remember, has no logical
addresses. simple, being simple, does not
worry about that (rare) case.
If the nopage method is left
NULL
, kernel code that handles page faults maps the
zero page to the faulting virtual address. The zero page is a
copy-on-write page that reads as zero and that is used, for example,
to map the BSS segment. Therefore, if a process extends a mapped
region by calling mremap, and the driver hasn’t
implemented nopage, it will end up with zero
pages instead of a segmentation fault.
The nopage method normally returns a pointer to a
struct page
. If, for some reason, a normal page
cannot be returned (e.g., the requested address is beyond the device’s
memory region), NOPAGE_SIGBUS
can be returned to
signal the error. nopage can also return
NOPAGE_OOM
to indicate failures caused by resource
limitations.
Note that this implementation will work for ISA memory regions but not
for those on the PCI bus. PCI memory is mapped above the highest
system memory, and there are no entries in the system memory map for
those addresses. Because there is thus no struct page
to return a pointer to, nopage
cannot be used in these situations; you must, instead, use
remap_page_range.
All the examples we’ve seen so far are reimplementations of
/dev/mem
; they remap physical addresses into user
space. The typical driver, however, wants to map only the small
address range that applies to its peripheral device, not all of
memory. In order to map to user space only a subset of the whole
memory range, the driver needs only to play with the offsets. The
following lines will do the trick for a driver mapping a region of
simple_region_size
bytes, beginning at physical
address simple_region_start
(which should be page
aligned).
unsigned long off = vma->vm_pgoff << PAGE_SHIFT; unsigned long physical = simple_region_start + off; unsigned long vsize = vma->vm_end - vma->vm_start; unsigned long psize = simple_region_size - off; if (vsize > psize) return -EINVAL; /* spans too high */ remap_page_range(vma_>vm_start, physical, vsize, vma->vm_page_prot);
In addition to calculating the offsets, this code introduces a check
that reports an error when the program tries to map more memory than
is available in the I/O region of the target device. In this code,
psize
is the physical I/O size that is left after
the offset has been specified, and vsize
is the
requested size of virtual memory; the function refuses to map
addresses that extend beyond the allowed memory range.
Note that the user process can always use mremap to extend its mapping, possibly past the end of the physical device area. If your driver has no nopage method, it will never be notified of this extension, and the additional area will map to the zero page. As a driver writer, you may well want to prevent this sort of behavior; mapping the zero page onto the end of your region is not an explicitly bad thing to do, but it is highly unlikely that the programmer wanted that to happen.
The simplest way to prevent extension of the mapping is to implement a simple nopage method that always causes a bus signal to be sent to the faulting process. Such a method would look like this:
struct page *simple_nopage(struct vm_area_struct *vma, unsigned long address, int write_access); { return NOPAGE_SIGBUS; /* send a SIGBUS */}
Of course, a more thorough implementation could check to see if the faulting address is within the device area, and perform the remapping if that is the case. Once again, however, nopage will not work with PCI memory areas, so extension of PCI mappings is not possible. In Linux, a page of physical addresses is marked as “reserved” in the memory map to indicate that it is not available for memory management. On the PC, for example, the range between 640 KB and 1 MB is marked as reserved, as are the pages that host the kernel code itself.
An interesting limitation of remap_page_range is that it gives access only to reserved pages and physical addresses above the top of physical memory. Reserved pages are locked in memory and are the only ones that can be safely mapped to user space; this limitation is a basic requirement for system stability.
Therefore, remap_page_range won’t allow you to remap conventional addresses—which include the ones you obtain by calling get_free_page. Instead, it will map in the zero page. Nonetheless, the function does everything that most hardware drivers need it to, because it can remap high PCI buffers and ISA memory.
The limitations of remap_page_range can be seen
by running mapper, one of the sample
programs in misc-progs
in the files provided on
the O’Reilly FTP site. mapper is a simple
tool that can be used to quickly test the mmap
system call; it maps read-only parts of a file based on the
command-line options and dumps the mapped region to standard output.
The following session, for instance, shows that
/dev/mem
doesn’t map the physical page located at
address 64 KB—instead we see a page full of zeros (the host
computer in this examples is a PC, but the result would be the same on
other platforms):
morgana.root# ./mapper /dev/mem 0x10000 0x1000 | od -Ax -t x1 mapped "/dev/mem" from 65536 to 69632 000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 001000
The inability of remap_page_range to deal with RAM suggests that a device like scullp can’t easily implement mmap, because its device memory is conventional RAM, not I/O memory. Fortunately, a relatively easy workaround is available to any driver that needs to map RAM into user space; it uses the nopage method that we have seen earlier.
The way to map real RAM to user space is to use
vm_ops->nopage
to deal with page faults one at a
time. A sample implementation is part of the
scullp module, introduced in Chapter 7.
scullp is the page oriented char device. Because it is page oriented, it can implement mmap on its memory. The code implementing memory mapping uses some of the concepts introduced earlier in Section 13.1.”
Before examining the code, let’s look at the design choices that affect the mmap implementation in scullp.
scullp doesn’t release device memory as long as the device is mapped. This is a matter of policy rather than a requirement, and it is different from the behavior of scull and similar devices, which are truncated to a length of zero when opened for writing. Refusing to free a mapped scullp device allows a process to overwrite regions actively mapped by another process, so you can test and see how processes and device memory interact. To avoid releasing a mapped device, the driver must keep a count of active mappings; the
vmas
field in the device structure is used for this purpose.Memory mapping is performed only when the scullp
order
parameter is 0. The parameter controls how get_free_pages is invoked (see Chapter 7, Section 7.3). This choice is dictated by the internals of get_free_pages, the allocation engine exploited by scullp. To maximize allocation performance, the Linux kernel maintains a list of free pages for each allocation order, and only the page count of the first page in a cluster is incremented by get_free_pages and decremented by free_pages. The mmap method is disabled for a scullp device if the allocation order is greater than zero, because nopage deals with single pages rather than clusters of pages. (Return to Section 7.3.1 in Chapter 7 if you need a refresher on scullp and the memory allocation order value.)
The last choice is mostly intended to keep the code simple. It is possible to correctly implement mmap for multipage allocations by playing with the usage count of the pages, but it would only add to the complexity of the example without introducing any interesting information.
Code that is intended to map RAM according to the rules just outlined needs to implement open, close, and nopage; it also needs to access the memory map to adjust the page usage counts.
This implementation of scullp_mmap is very short, because it relies on the nopage function to do all the interesting work:
int scullp_mmap(struct file *filp, struct vm_area_struct *vma) { struct inode *inode = INODE_FROM_F(filp); /* refuse to map if order is not 0 */ if (scullp_devices[MINOR(inode->i_rdev)].order) return -ENODEV; /* don't do anything here: "nopage" will fill the holes */ vma->vm_ops = &scullp_vm_ops; vma->vm_flags |= VM_RESERVED; vma->vm_private_data = scullp_devices + MINOR(inode->i_rdev); scullp_vma_open(vma); return 0; }
The purpose of the leading conditional is to avoid mapping devices
whose allocation order is not 0. scullp’s
operations are stored in the vm_ops
field, and a
pointer to the device structure is stashed in the
vm_private_data
field. At the end,
vm_ops->open
is called to update the usage count
for the module and the count of active mappings for the device.
open and close simply keep track of these counts and are defined as follows:
void scullp_vma_open(struct vm_area_struct *vma) { ScullP_Dev *dev = scullp_vma_to_dev(vma); dev->vmas++; MOD_INC_USE_COUNT; } void scullp_vma_close(struct vm_area_struct *vma) { ScullP_Dev *dev = scullp_vma_to_dev(vma); dev->vmas--; MOD_DEC_USE_COUNT; }
The function sculls_vma_to_dev simply returns the
contents of the vm_private_data
field. It exists
as a separate function because kernel versions prior to 2.4 lacked
that field, requiring that other means be used to get that pointer.
See Section 13.5 at the end of this chapter for
details.
Most of the work is then performed by nopage. In
the scullp implementation, the
address
parameter to nopage is
used to calculate an offset into the device; the offset is then used
to look up the correct page in the scullp
memory tree.
struct page *scullp_vma_nopage(struct vm_area_struct *vma, unsigned long address, int write) { unsigned long offset; ScullP_Dev *ptr, *dev = scullp_vma_to_dev(vma); struct page *page = NOPAGE_SIGBUS; void *pageptr = NULL; /* default to "missing" */ down(&dev->sem); offset = (address - vma->vm_start) + VMA_OFFSET(vma); if (offset >= dev->size) goto out; /* out of range */ /* * Now retrieve the scullp device from the list, then the page. * If the device has holes, the process receives a SIGBUS when * accessing the hole. */ offset >>= PAGE_SHIFT; /* offset is a number of pages */ for (ptr = dev; ptr && offset >= dev->qset;) { ptr = ptr->next; offset -= dev->qset; } if (ptr && ptr->data) pageptr = ptr->data[offset]; if (!pageptr) goto out; /* hole or end-of-file */ page = virt_to_page(pageptr); /* got it, now increment the count */ get_page(page); out: up(&dev->sem); return page; }
scullp uses memory obtained with
get_free_pages. That memory is addressed using
logical addresses, so all scullp_nopage has to do
to get a struct page
pointer is to call
virt_to_page.
The scullp device now works as expected, as
you can see in this sample output from the
mapper utility. Here we send a directory
listing of /dev
(which is long) to the
scullp device, and then use the
mapper utility to look at pieces of that
listing with mmap.
morgana% ls -l /dev > /dev/scullp morgana% ./mapper /dev/scullp 0 140 mapped "/dev/scullp" from 0 to 140 total 77 -rwxr-xr-x 1 root root 26689 Mar 2 2000 MAKEDEV crw-rw-rw- 1 root root 14, 14 Aug 10 20:55 admmidi0 morgana% ./mapper /dev/scullp 8192 200 mapped "/dev/scullp" from 8192 to 8392 0 crw -- -- -- - 1 root root 113, 1 Mar 26 1999 cum1 crw -- -- -- - 1 root root 113, 2 Mar 26 1999 cum2 crw -- -- -- - 1 root root 113, 3 Mar 26 1999 cum3
Although it’s rarely necessary, it’s interesting to see how a driver can map a virtual address to user space using mmap. A true virtual address, remember, is an address returned by a function like vmalloc or kmap—that is, a virtual address mapped in the kernel page tables. The code in this section is taken from scullv, which is the module that works like scullp but allocates its storage through vmalloc.
Most of the scullv implementation is like
the one we’ve just seen for scullp, except
that there is no need to check the order
parameter
that controls memory allocation. The reason for this is that
vmalloc allocates its pages one at a time,
because single-page allocations are far more likely to succeed than
multipage allocations. Therefore, the allocation order problem
doesn’t apply to vmalloced space.
Most of the work of vmalloc is building page
tables to access allocated pages as a continuous address range. The
nopage method, instead, must pull the page
tables back apart in order to return a struct page
pointer to the caller. Therefore, the nopage
implementation for scullv must scan the
page tables to retrieve the page map entry associated with the page.
The function is similar to the one we saw for scullp, except at the end. This code excerpt only includes the part of nopage that differs from scullp:
pgd_t *pgd; pmd_t *pmd; pte_t *pte; unsigned long lpage; /* * After scullv lookup, "page" is now the address of the page * needed by the current process. Since it's a vmalloc address, * first retrieve the unsigned long value to be looked up * in page tables. */ lpage = VMALLOC_VMADDR(pageptr); spin_lock(&init_mm.page_table_lock); pgd = pgd_offset(&init_mm, lpage); pmd = pmd_offset(pgd, lpage); pte = pte_offset(pmd, lpage); page = pte_page(*pte); spin_unlock(&init_mm.page_table_lock); /* got it, now increment the count */ get_page(page); out: up(&dev->sem); return page;
The page tables are looked up using the functions introduced at the
beginning of this chapter. The page directory used for this purpose is
stored in the memory structure for kernel space,
init_mm
. Note that
scullv obtains the
page_table_lock
prior to traversing the page
tables. If that lock were not held, another processor could make a
change to the page table while scullv was
halfway through the lookup process, leading to erroneous results.
The macro VMALLOC_VMADDR(pageptr)
returns the
correct unsigned long
value to be used in a
page-table lookup from a vmalloc address. A
simple cast of the value wouldn’t work on the x86 with kernels older
than 2.1, because of a glitch in memory management. Memory management for
the x86 changed in version 2.1.1, and
VMALLOC_VMADDR
is now defined as the identity
function, as it has always been for the other platforms. Its use is
still suggested, however, as a way of writing portable code.
Based on this discussion, you might also want to map addresses returned by ioremap to user space. This mapping is easily accomplished because you can use remap_page_range directly, without implementing methods for virtual memory areas. In other words, remap_page_range is already usable for building new page tables that map I/O memory to user space; there’s no need to look in the kernel page tables built by vremap as we did in scullv.
Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.