Documentation/mm/page_tables.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 ===========
   4 Page Tables
   5 ===========
   6
   7 Paged virtual memory was invented along with virtual memory as a concept in
   8 1962 on the Ferranti Atlas Computer which was the first computer with paged
   9 virtual memory. The feature migrated to newer computers and became a de facto
  10 feature of all Unix-like systems as time went by. In 1985 the feature was
  11 included in the Intel 80386, which was the CPU Linux 1.0 was developed on.
  12
  13 Page tables map virtual addresses as seen by the CPU into physical addresses
  14 as seen on the external memory bus.
  15
  16 Linux defines page tables as a hierarchy which is currently five levels in
  17 height. The architecture code for each supported architecture will then
  18 map this to the restrictions of the hardware.
  19
  20 The physical address corresponding to the virtual address is often referenced
  21 by the underlying physical page frame. The **page frame number** or **pfn**
  22 is the physical address of the page (as seen on the external memory bus)
  23 divided by `PAGE_SIZE`.
  24
  25 Physical memory address 0 will be *pfn 0* and the highest pfn will be
  26 the last page of physical memory the external address bus of the CPU can
  27 address.
  28
  29 With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
  30 address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000
  31 and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are
  32 at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3fffff.
  33
  34 As you can see, with 4KB pages the page base address uses bits 12-31 of the
  35 address, and this is why `PAGE_SHIFT` in this case is defined as 12 and
  36 `PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)`
  37
  38 Over time a deeper hierarchy has been developed in response to increasing memory
  39 sizes. When Linux was created, 4KB pages and a single page table called
  40 `swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with
  41 the fact that Torvald's first computer had 4MB of physical memory. Entries in
  42 this single table were referred to as *PTE*:s - page table entries.
  43
  44 The software page table hierarchy reflects the fact that page table hardware has
  45 become hierarchical and that in turn is done to save page table memory and
  46 speed up mapping.
  47
  48 One could of course imagine a single, linear page table with enormous amounts
  49 of entries, breaking down the whole memory into single pages. Such a page table
  50 would be very sparse, because large portions of the virtual memory usually
  51 remains unused. By using hierarchical page tables large holes in the virtual
  52 address space does not waste valuable page table memory, because it will suffice
  53 to mark large areas as unmapped at a higher level in the page table hierarchy.
  54
  55 Additionally, on modern CPUs, a higher level page table entry can point directly
  56 to a physical memory range, which allows mapping a contiguous range of several
  57 megabytes or even gigabytes in a single high-level page table entry, taking
  58 shortcuts in mapping virtual memory to physical memory: there is no need to
  59 traverse deeper in the hierarchy when you find a large mapped range like this.
  60
  61 The page table hierarchy has now developed into this::
  62
  63   +-----+
  64   | PGD |
  65   +-----+
  66      |
  67      |   +-----+
  68      +-->| P4D |
  69          +-----+
  70             |
  71             |   +-----+
  72             +-->| PUD |
  73                 +-----+
  74                    |
  75                    |   +-----+
  76                    +-->| PMD |
  77                        +-----+
  78                           |
  79                           |   +-----+
  80                           +-->| PTE |
  81                               +-----+
  82
  83
  84 Symbols on the different levels of the page table hierarchy have the following
  85 meaning beginning from the bottom:
  86
  87 - **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
  88   The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each
  89   mapping a single page of virtual memory to a single page of physical memory.
  90   The architecture defines the size and contents of `pteval_t`.
  91
  92   A typical example is that the `pteval_t` is a 32- or 64-bit value with the
  93   upper bits being a **pfn** (page frame number), and the lower bits being some
  94   architecture-specific bits such as memory protection.
  95
  96   The **entry** part of the name is a bit confusing because while in Linux 1.0
  97   this did refer to a single page table entry in the single top level page
  98   table, it was retrofitted to be an array of mapping elements when two-level
  99   page tables were first introduced, so the *pte* is the lowermost page
 100   *table*, not a page table *entry*.
 101
 102 - **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right
 103   above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.
 104
 105 - **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
 106   the other levels to handle 4-level page tables. It is potentially unused,
 107   or *folded* as we will discuss later.
 108
 109 - **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
 110   handle 5-level page tables after the *pud* was introduced. Now it was clear
 111   that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
 112   directory level and that we cannot go on with ad hoc names any more. This
 113   is only used on systems which actually have 5 levels of page tables, otherwise
 114   it is folded.
 115
 116 - **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
 117   main page table handling the PGD for the kernel memory is still found in
 118   `swapper_pg_dir`, but each userspace process in the system also has its own
 119   memory context and thus its own *pgd*, found in `struct mm_struct` which
 120   in turn is referenced to in each `struct task_struct`. So tasks have memory
 121   context in the form of a `struct mm_struct` and this in turn has a
 122   `struct pgt_t *pgd` pointer to the corresponding page global directory.
 123
 124 To repeat: each level in the page table hierarchy is a *array of pointers*, so
 125 the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d**
 126 contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of
 127 pointers on each level is architecture-defined.::
 128
 129         PMD
 130   --> +-----+           PTE
 131       | ptr |-------> +-----+
 132       | ptr |-        | ptr |-------> PAGE
 133       | ptr | \       | ptr |
 134       | ptr |  \        ...
 135       | ... |   \
 136       | ptr |    \         PTE
 137       +-----+     +----> +-----+
 138                          | ptr |-------> PAGE
 139                          | ptr |
 140                            ...
 141
 142
 143 Page Table Folding
 144 ==================
 145
 146 If the architecture does not use all the page table levels, they can be *folded*
 147 which means skipped, and all operations performed on page tables will be
 148 compile-time augmented to just skip a level when accessing the next lower
 149 level.
 150
 151 Page table handling code that wishes to be architecture-neutral, such as the
 152 virtual memory manager, will need to be written so that it traverses all of the
 153 currently five levels. This style should also be preferred for
 154 architecture-specific code, so as to be robust to future changes.
 155
 156
 157 MMU, TLB, and Page Faults
 158 =========================
 159
 160 The `Memory Management Unit (MMU)` is a hardware component that handles virtual
 161 to physical address translations. It may use relatively small caches in hardware
 162 called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up
 163 these translations.
 164
 165 When CPU accesses a memory location, it provides a virtual address to the MMU,
 166 which checks if there is the existing translation in the TLB or in the Page
 167 Walk Caches (on architectures that support them). If no translation is found,
 168 MMU uses the page walks to determine the physical address and create the map.
 169
 170 The dirty bit for a page is set (i.e., turned on) when the page is written to.
 171 Each page of memory has associated permission and dirty bits. The latter
 172 indicate that the page has been modified since it was loaded into memory.
 173
 174 If nothing prevents it, eventually the physical memory can be accessed and the
 175 requested operation on the physical frame is performed.
 176
 177 There are several reasons why the MMU can't find certain translations. It could
 178 happen because the CPU is trying to access memory that the current task is not
 179 permitted to, or because the data is not present into physical memory.
 180
 181 When these conditions happen, the MMU triggers page faults, which are types of
 182 exceptions that signal the CPU to pause the current execution and run a special
 183 function to handle the mentioned exceptions.
 184
 185 There are common and expected causes of page faults. These are triggered by
 186 process management optimization techniques called "Lazy Allocation" and
 187 "Copy-on-Write". Page faults may also happen when frames have been swapped out
 188 to persistent storage (swap partition or file) and evicted from their physical
 189 locations.
 190
 191 These techniques improve memory efficiency, reduce latency, and minimize space
 192 occupation. This document won't go deeper into the details of "Lazy Allocation"
 193 and "Copy-on-Write" because these subjects are out of scope as they belong to
 194 Process Address Management.
 195
 196 Swapping differentiates itself from the other mentioned techniques because it's
 197 undesirable since it's performed as a means to reduce memory under heavy
 198 pressure.
 199
 200 Swapping can't work for memory mapped by kernel logical addresses. These are a
 201 subset of the kernel virtual space that directly maps a contiguous range of
 202 physical memory. Given any logical address, its physical address is determined
 203 with simple arithmetic on an offset. Accesses to logical addresses are fast
 204 because they avoid the need for complex page table lookups at the expenses of
 205 frames not being evictable and pageable out.
 206
 207 If the kernel fails to make room for the data that must be present in the
 208 physical frames, the kernel invokes the out-of-memory (OOM) killer to make room
 209 by terminating lower priority processes until pressure reduces under a safe
 210 threshold.
 211
 212 Additionally, page faults may be also caused by code bugs or by maliciously
 213 crafted addresses that the CPU is instructed to access. A thread of a process
 214 could use instructions to address (non-shared) memory which does not belong to
 215 its own address space, or could try to execute an instruction that want to write
 216 to a read-only location.
 217
 218 If the above-mentioned conditions happen in user-space, the kernel sends a
 219 `Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually
 220 causes the termination of the thread and of the process it belongs to.
 221
 222 This document is going to simplify and show an high altitude view of how the
 223 Linux kernel handles these page faults, creates tables and tables' entries,
 224 check if memory is present and, if not, requests to load data from persistent
 225 storage or from other devices, and updates the MMU and its caches.
 226
 227 The first steps are architecture dependent. Most architectures jump to
 228 `do_page_fault()`, whereas the x86 interrupt handler is defined by the
 229 `DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`.
 230
 231 Whatever the routes, all architectures end up to the invocation of
 232 `handle_mm_fault()` which, in turn, (likely) ends up calling
 233 `__handle_mm_fault()` to carry out the actual work of allocating the page
 234 tables.
 235
 236 The unfortunate case of not being able to call `__handle_mm_fault()` means
 237 that the virtual address is pointing to areas of physical memory which are not
 238 permitted to be accessed (at least from the current context). This
 239 condition resolves to the kernel sending the above-mentioned SIGSEGV signal
 240 to the process and leads to the consequences already explained.
 241
 242 `__handle_mm_fault()` carries out its work by calling several functions to
 243 find the entry's offsets of the upper layers of the page tables and allocate
 244 the tables that it may need.
 245
 246 The functions that look for the offset have names like `*_offset()`, where the
 247 "*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the
 248 corresponding tables, layer by layer, are called `*_alloc`, using the
 249 above-mentioned convention to name them after the corresponding types of tables
 250 in the hierarchy.
 251
 252 The page table walk may end at one of the middle or upper layers (PMD, PUD).
 253
 254 Linux supports larger page sizes than the usual 4KB (i.e., the so called
 255 `huge pages`). When using these kinds of larger pages, higher level pages can
 256 directly map them, with no need to use lower level page entries (PTE). Huge
 257 pages contain large contiguous physical regions that usually span from 2MB to
 258 1GB. They are respectively mapped by the PMD and PUD page entries.
 259
 260 The huge pages bring with them several benefits like reduced TLB pressure,
 261 reduced page table overhead, memory allocation efficiency, and performance
 262 improvement for certain workloads. However, these benefits come with
 263 trade-offs, like wasted memory and allocation challenges.
 264
 265 At the very end of the walk with allocations, if it didn't return errors,
 266 `__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()`
 267 performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`.
 268 "read", "cow", "shared" give hints about the reasons and the kind of fault it's
 269 handling.
 270
 271 The actual implementation of the workflow is very complex. Its design allows
 272 Linux to handle page faults in a way that is tailored to the specific
 273 characteristics of each architecture, while still sharing a common overall
 274 structure.
 275
 276 To conclude this high altitude view of how Linux handles page faults, let's
 277 add that the page faults handler can be disabled and enabled respectively with
 278 `pagefault_disable()` and `pagefault_enable()`.
 279
 280 Several code path make use of the latter two functions because they need to
 281 disable traps into the page faults handler, mostly to prevent deadlocks.