x86分页如何工作?

Cir*_*四事件 78 paging x86 virtual-memory

这个问题旨在填补关于该主题的良好免费信息的真空.

我相信一个好的答案将适合一个大的答案或至少在几个答案.

主要目标是为完整的初学者提供足够的信息,以便他们可以自己学习手册,并能够理解与分页相关的基本操作系统概念.

建议的指导方针:

  • 答案应该是初学者友好的:
    • 具体但可能简化的例子非常重要
    • 欢迎使用所示概念的应用
  • 引用有用的资源是好的
  • 我们欢迎操作系统如何使用分页功能
  • PAE和PSE的解释是受欢迎的
  • 我们欢迎对x86_64进行小规模的讨论

相关问题以及为什么我认为它们不是愚蠢的:

Cir*_*四事件 128

Version of this answer with a nice TOC and more content.

I will correct any error reported. If you want to make large modifications or add a missing aspect, make them on your own answers to get well deserved rep. Minor edits can be merged directly in.

Sample code

Minimal example: https://github.com/cirosantilli/x86-bare-metal-examples/blob/5c672f73884a487414b3e21bd9e579c67cd77621/paging.S

Like everything else in programming, the only way to really understand this is to play with minimal examples.

What makes this a "hard" subject is that the minimal example is large because you need to make your own small OS.

Intel manual

Although it is impossible to understand without examples in mind, try to get familiar with the manuals as soon as possible.

英特尔描述了英特尔手册第3卷系统编程指南 - 325384-056US 2015年9月第4章"寻呼"中的寻呼.

特别有趣的是图4-4"具有32位寻呼的CR3和寻呼结构条目的格式",它给出了关键数据结构.

MMU

分页由CPU 的内存管理单元(MMU)部分完成.与许多其他产品(例如x87协处理器,APIC)一样,早期的芯片曾经是独立芯片,后来集成到CPU中.但该术语仍在使用.

普遍事实

逻辑地址是"常规"用户域代码中使用的存储器地址(例如,rsiin 的内容mov eax, [rsi]).

第一个分段将它们转换为线性地址,然后分页然后将线性地址转换为物理地址.

(logical) ------------------> (linear) ------------> (physical)
             segmentation                 paging
Run Code Online (Sandbox Code Playgroud)

大多数情况下,我们可以将物理地址视为索引实际RAM硬件存储单元,但这不是100%真实,因为:

分页仅在受保护模式下可用.在保护模式下使用分页是可选的.如果寄存器的PG位置位,则寻呼打开cr0.

分页与分段

分页和分段之间的一个主要区别是:

  • 分页将RAM分成称为页面的相等大小的块
  • 分段将内存分成任意大小的块

这是分页的主要优点,因为相同大小的块使事情更易于管理.

Paging has become so much more popular that support for segmentation was dropped in x86-64 in 64-bit mode, the main mode of operation for new software, where it only exists in compatibility mode, which emulates IA32.

Application

Paging is used to implement processes virtual address spaces on modern OS. With virtual addresses the OS can fit two or more concurrent processes on a single RAM in a way that:

  • both programs need to know nothing about the other
  • the memory of both programs can grow and shrink as needed
  • the switch between programs is very fast
  • one program can never access the memory of another process

Paging historically came after segmentation, and largely replaced it for the implementation of virtual memory in modern OSs such as Linux since it is easier to manage the fixed sized chunks of memory of pages instead of variable length segments.

Hardware implementation

Like segmentation in protected mode (where modifying a segment register triggers a load from the GDT or LDT), paging hardware uses data structures in memory to do its job (page tables, page directories, etc.).

The format of those data structures is fixed by the hardware, but it is up to the OS to set up and manage those data structures on RAM correctly, and to tell the hardware where to find them (via cr3).

Some other architectures leave paging almost completely in the hands of software, so a TLB miss runs an OS-supplied function to walk the page tables and insert the new mapping into the TLB. This leaves page table formats to be chosen by the OS, but makes it unlikely for the hardware to be able to overlap page-walks with out-of-order execution of other instructions, the way x86 can.

Example: simplified single-level paging scheme

This is an example of how paging operates on a simplified version of the x86 architecture to implement a virtual memory space.

Page tables

The OS could give them the following page tables:

Page table given to process 1 by the OS:

RAM location        physical address   present
-----------------   -----------------  --------
PT1 + 0       * L   0x00001            1
PT1 + 1       * L   0x00000            1
PT1 + 2       * L   0x00003            1
PT1 + 3       * L                      0
...                                    ...
PT1 + 0xFFFFF * L   0x00005            1
Run Code Online (Sandbox Code Playgroud)

Page table given to process 2 by the OS:

RAM location       physical address   present
-----------------  -----------------  --------
PT2 + 0       * L  0x0000A            1
PT2 + 1       * L  0x0000B            1
PT2 + 2       * L                     0
PT2 + 3       * L  0x00003            1
...                ...                ...
PT2 + 0xFFFFF * L  0x00004            1
Run Code Online (Sandbox Code Playgroud)

Where:

  • PT1 and PT2: initial position of table 1 and 2 on RAM.

    Sample values: 0x00000000, 0x12345678, etc.

    It is the OS that decides those values.

  • L: length of a page table entry.

  • present: indicates that the page is present in memory.

Page tables are located on RAM. They could for example be located as:

--------------> 0xFFFFFFFF


--------------> PT1 + 0xFFFFF * L
Page Table 1
--------------> PT1


--------------> PT2 + 0xFFFFF * L
Page Table 2
--------------> PT2

--------------> 0x0
Run Code Online (Sandbox Code Playgroud)

The initial locations on RAM for both page tables are arbitrary and controlled by the OS. It is up to the OS to ensure that they don't overlap!

Each process cannot touch any page tables directly, although it can make requests to the OS that cause the page tables to be modified, for example asking for larger stack or heap segments.

A page is a chunk of 4KB (12 bits), and since addresses have 32 bits, only 20 bits (20 + 12 = 32, thus 5 characters in hexadecimal notation) are required to identify each page. This value is fixed by the hardware.

Page table entries

A page table is... a table of pages table entries!

The exact format of table entries is fixed by the hardware.

On this simplified example, the page table entries contain only two fields:

bits   function
-----  -----------------------------------------
20     physical address of the start of the page
1      present flag
Run Code Online (Sandbox Code Playgroud)

so in this example the hardware designers could have chosen L = 21.

Most real page table entries have other fields.

It would be impractical to align things at 21 bytes since memory is addressable by bytes and not bits. Therefore, even in only 21 bits are needed in this case, hardware designers would probably choose L = 32 to make access faster, and just reserve bits the remaining bits for later usage. The actual value for L on x86 is 32 bits.

Address translation in single-level scheme

Once the page tables have been set up by the OS, the address translation between linear and physical addresses is done by the hardware.

When the OS wants to activate process 1, it sets the cr3 to PT1, the start of the table for process one.

If Process 1 wants to access linear address 0x00000001, the paging hardware circuit automatically does the following for the OS:

  • split the linear address into two parts:

    | page (20 bits) | offset (12 bits) |
    
    Run Code Online (Sandbox Code Playgroud)

    So in this case we would have:

    • page = 0x00000
    • offset = 0x001
  • look into Page table 1 because cr3 points to it.

  • look entry 0x00000 because that is the page part.

    The hardware knows that this entry is located at RAM address PT1 + 0 * L = PT1.

  • since it is present, the access is valid

  • by the page table, the location of page number 0x00000 is at 0x00001 * 4K = 0x00001000.

  • to find the final physical address we just need to add the offset:

      00001 000
    + 00000 001
      -----------
      00001 001
    
    Run Code Online (Sandbox Code Playgroud)

    because 00001 is the physical address of the page looked up on the table and 001 is the offset.

    As the name indicates, the offset is always simply added the physical address of the page.

  • the hardware then gets the memory at that physical location.

In the same way, the following translations would happen for process 1:

linear     physical
---------  ---------
00000 002  00001 002
00000 003  00001 003
00000 FFF  00001 FFF
00001 000  00000 000
00001 001  00000 001
00001 FFF  00000 FFF
00002 000  00002 000
FFFFF 000  00005 000
Run Code Online (Sandbox Code Playgroud)

For example, when accessing address 00001000, the page part is 00001 the hardware knows that its page table entry is located at RAM address: PT1 + 1 * L (1 because of the page part), and that is where it will look for it.

When the OS wants to switch to process 2, all it needs to do is to make cr3 point to page 2. It is that simple!

Now the following translations would happen for process 2:

linear     physical
---------  ---------
00000 002  00001 002
00000 003  00001 003
00000 FFF  00001 FFF
00001 000  00000 000
00001 001  00000 001
00001 FFF  00000 FFF
00003 000  00003 000
FFFFF 000  00004 000
Run Code Online (Sandbox Code Playgroud)

The same linear address translates to different physical addresses for different processes, depending only on the value inside cr3.

In this way every program can expect its data to start at 0 and end at FFFFFFFF, without worrying about exact physical addresses.

Page fault

What if Process 1 tries to access an address inside a page that is no present?

The hardware notifies the software via a Page Fault Exception.

It is then usually up to the OS to register an exception handler to decide what has to be done.

It is possible that accessing a page that is not on the table is a programming error:

int is[1];
is[2] = 1;
Run Code Online (Sandbox Code Playgroud)

but there may be cases in which it is acceptable, for example in Linux when:

  • the program wants to increase its stack.

    It just tries to accesses a certain byte in a given possible range, and if the OS is happy it adds that page to the process address space.

  • the page was swapped to disk.

    The OS will need to do some work behind the processes back to get the page back into RAM.

    The OS can discover that this is the case based on the contents of the rest of the page table entry, since if the present flag is clear, the other entries of the page table entry are completely left for the OS to to what it wants.

    On Linux for example, when present = 0:

    • if all the fields of the page table entry are 0, invalid address.

    • else, the page has been swapped to disk, and the actual values of those fields encode the position of the page on the disk.

In any case, the OS needs to know which address generated the Page Fault to be able to deal with the problem. This is why the nice IA32 developers set the value of cr2 to that address whenever a Page Fault occurs. The exception handler can then just look into cr2 to get the address.

Simplifications

Simplifications to reality that make this example easier to understand:

  • all real paging circuits use multi-level paging to save space, but this showed a simple single-level scheme.

  • page tables contained only two fields: a 20 bit address and a 1 bit present flag.

    Real page tables contain a total of 12 fields, and therefore other features which have been omitted.

Example: multi-level paging scheme

The problem with a single-level paging scheme is that it would take up too much RAM: 4G/4K = 1M entries per process. If each entry is 4 bytes long, that would make 4M per process, which is too much even for a desktop computer: ps -A | wc -l says that I am running 244 processes right now, so that would take around 1GB of my RAM!

For this reason, x86 developers decided to use a multi-level scheme that reduces RAM usage.

The downside of this system is that is has a slightly higher access time.

In the simple 3 level paging scheme used for 32 bit processors without PAE, the 32 address bits are divided as follows:

| directory (10 bits) | table (10 bits) | offset (12 bits) |
Run Code Online (Sandbox Code Playgroud)

Each process must have one and only one page directory associated to it, so it will contain at least 2^10 = 1K page directory entries, much better than the minimum 1M required on a single-level scheme.

Page tables are only allocated as needed by the OS. Each page table has 2^10 = 1K page directory entries

Page directories contain... page directory entries! Page directory entries are the same as page table entries except that they point to RAM addresses of page tables instead of physical addresses of tables. Since those addresses are only 20 bits wide, page tables must be on the beginning of 4KB pages.

cr3 now points to the location on RAM of the page directory of the current process instead of page tables.

Page tables entries don't change at all from a single-level scheme.

Page tables change from a single-level scheme because:

  • each process may have up to 1K page tables, one per page directory entry.
  • each page table contains exactly 1K entries instead of 1M entries.

The reason for using 10 bits on the first two levels (and not, say, 12 | 8 | 12 ) is that each Page Table entry is 4 bytes long. Then the 2^10 entries of Page directories and Page Tables will fit nicely into 4Kb pages. This means that it faster and simpler to allocate and deallocate pages for that purpose.

Address translation in multi-level scheme

Page directory given to process 1 by the OS:

RAM location     physical address   present
---------------  -----------------  --------
PD1 + 0     * L  0x10000            1
PD1 + 1     * L                     0
PD1 + 2     * L  0x80000            1
PD1 + 3     * L                     0
...                                 ...
PD1 + 0x3FF * L                     0
Run Code Online (Sandbox Code Playgroud)

Page tables given to process 1 by the OS at PT1 = 0x10000000 (0x10000*4K):

RAM location      physical address   present
---------------   -----------------  --------
PT1 + 0     * L   0x00001            1
PT1 + 1     * L                      0
PT1 + 2     * L   0x0000D            1
...                                  ...
PT1 + 0x3FF * L   0x00005            1
Run Code Online (Sandbox Code Playgroud)

Page tables given to process 1 by the OS at PT2 = 0x80000000 (0x80000*4K):

RAM location      physical address   present
---------------   -----------------  --------
PT2 + 0     * L   0x0000A            1
PT2 + 1     * L   0x0000C            1
PT2 + 2     * L                      0
...                                  ...
PT2 + 0x3FF * L   0x00003            1
Run Code Online (Sandbox Code Playgroud)

where:

  • PD1: initial position of page directory of process 1 on RAM.
  • PT1 and PT2: initial position of page table 1 and page table 2 for process 1 on RAM.

So in this example the page directory and the page table could be stored in RAM something like:

----------------> 0xFFFFFFFF


----------------> PT2 + 0x3FF * L
Page Table 1
----------------> PT2

----------------> PD1 + 0x3FF * L
Page Directory 1
----------------> PD1


----------------> PT1 + 0x3FF * L
Page Table 2
----------------> PT1

----------------> 0x0
Run Code Online (Sandbox Code Playgroud)

Let's translate the linear address 0x00801004 step by step.

We suppose that cr3 = PD1, that is, it points to the page directory just described.

In binary the linear address is:

0    0    8    0    1    0    0    4
0000 0000 1000 0000 0001 0000 0000 0100
Run Code Online (Sandbox Code Playgroud)

Grouping as 10 | 10 | 12 gives:

0000000010 0000000001 000000000100
0x2        0x1        0x4
Run Code Online (Sandbox Code Playgroud)

which gives:

  • page directory entry = 0x2
  • page table entry = 0x1
  • offset = 0x4

So the hardware looks for entry 2 of the page directory.

The page directory table says that the page table is located at 0x80000 * 4K = 0x80000000. This is the first RAM access of the process.

Since the page table entry is 0x1, the hardware looks at entry 1 of the page table at 0x80000000, which tells it that the physical page is located at address 0x0000C * 4K = 0x0000C000. This is the second RAM access of the process.

Finally, the paging hardware adds the offset, and the final address is 0x0000C004.

Other examples of translated addresses are:

linear    10 10 12 split   physical
--------  ---------------  ----------
00000001  000 000 001      00001001
00001001  000 001 001      page fault
003FF001  000 3FF 001      00005001
00400000  001 000 000      page fault
00800001  002 000 001      0000A001
00801008  002 001 008      0000C008
00802008  002 002 008      page fault
00B00001  003 000 000      page fault
Run Code Online (Sandbox Code Playgroud)

Page faults occur if either a page directory entry or a page table entry is not present.

If the OS wants to run another process concurrently, it would give the second process a separate page directory, and link that directory to separate page tables.

64-bit architectures

64 bits is still too much address for current RAM sizes, so most architectures will use less bits.

x86_64 uses 48 bits (256 TiB), and legacy mode's PAE already allows 52-bit addresses (4 PiB).

12 of those 48 bits are already reserved for the offset, which leaves 36 bits.

If a 2 level approach is taken, the best split would be two 18 bit levels.

But that would mean that the page directory would have 2^18 = 256K entries, which would take too much RAM: close to a single-level paging for 32 bit architectures!

Therefore, 64 bit architectures create even further page levels, commonly 3 or 4.

x86_64 uses 4 levels in a 9 | 9 | 9 | 12 scheme, so that the upper level only takes up only 2^9 higher level entries.

PAE

Physical address extension.

With 32 bits, only 4GB RAM can be addressed.

This started becoming a limitation for large servers, so Intel introduced the PAE mechanism to Pentium Pro.

To relieve the problem, Intel added 4 new address lines, so that 64GB could be addressed.

Page table structure is also altered if PAE is on. The exact way in which it is altered depends on weather PSE is on or off.

PAE is turned on and off via the PAE bit of cr4.

Even if the total addressable memory is 64GB, individual process are still only able to use up to 4GB. The OS can however put different processes on different 4GB chunks.

PSE

Page size extension.

Allows for pages to be 4M ( or 2M if PAE is on ) in length instead of 4K.

PSE is turned on and off via the PAE bit of cr4.

PAE and PSE page table schemes

If either PAE and PSE are active, different paging level schemes are used:

  • no PAE and no PSE: 10 | 10 | 12

  • no PAE and PSE: 10 | 22.

    22 is the offset within the 4Mb page, since 22 bits address 4Mb.

  • PAE and no PSE: 2 | 9 | 9 | 12

    The design reason why 9 is used twice instead of 10 is that now entries cannot fit anymore into 32 bits, which were all filled up by 20 address bits and 12 meaningful or reserved flag bits.

    The reason is that 20 bits are not enough anymore to represent the address of page tables: 24 bits are now needed because of the 4 extra wires added to the processor.

    Therefore, the designers decided to increase entry size to 64 bits, and to make them fit into a single page table it is necessary reduce the number of entries to 2^9 instead of 2^10.

    The starting 2 is a new Page level called Page Directory Pointer Table (PDPT), since it points to page directories and fill in the 32 bit linear address. PDPTs are also 64 bits wide.

    cr3 now points to PDPTs which must be on the fist four 4GB of memory and aligned on 32 bit multiples for addressing efficiency. This means that now cr3 has 27 significative bits instead of 20: 2^5 for the 32 multiples*2^27 to complete the 2^32 of the first 4GB.

  • PAE and PSE: 2 | 9 | 21

    Designers decided to keep a 9 bit wide field to make it fit into a single page.

    This leaves 23 bits. Leaving 2 for the PDPT to keep things uniform with the PAE case without PSE leaves 21 for offset, meaning that pages are 2M wide instead of 4M.

TLB

The Translation Lookahead Buffer (TLB) is a cache for paging addresses.

Since it is a cache, it shares many of the design issues of the CPU cache, such as associativity level.

This section shall describe a simplified fully associative TLB with 4 single address entries. Note that like other caches, real TLBs are not usually fully associative.

Basic operation

After a translation between linear and physical address happens, it is stored on the TLB. For example, a 4 entry TLB starts in the following state:

  valid   linear   physical
  ------  -------  ---------
> 0       00000    00000
  0       00000    00000
  0       00000    00000
  0       00000    00000
Run Code Online (Sandbox Code Playgroud)

The > indicates the current entry to be replaced.

and after a page linear address 00003 is translated to a physical address 00005, the TLB becomes:

  valid   linear   physical
  ------  -------  ---------
  1       00003    00005
> 0       00000    00000
  0       00000    00000
  0       00000    00000
Run Code Online (Sandbox Code Playgroud)

and after a second translation of 00007 to 00009 it becomes:

  valid   linear   physical
  ------  -------  ---------
  1       00003    00005
  1       00007    00009
> 0       00000    00000
  0       00000    00000
Run Code Online (Sandbox Code Playgroud)

Ker*_* SB 15

这是一个非常简短的高级答案:

x86处理器以几种可能的模式之一运行(大致:实数,受保护,64位).每个模式可以使用若干可能的存储器寻址模式中的一个(但不是每一个模式可以使用每一个模型),即:实模式寻址,分段寻址,和平面线性寻址.

在现代世界中,只有受保护或64位模式下的扁平线性寻址是相关的,并且这两种模式基本相同,主要区别在于机器字的大小以及可寻址的存储量.

现在,存储器寻址模式为机器指令的存储器操作数赋予了意义(例如mov DWORD PTR [eax], 25,它将dword值为25 的32位(又称)整数存储到存储器中,其地址存储在eax32位寄存器中).在扁平线性寻址中,eax允许该数字在单个连续范围内运行,从零到最大值(在我们的情况下为2 32  - 1).

然而,平面线性寻址可以是分页不分页.没有分页,地址直接指物理内存.通过分页,处理器的内存管理单元(或MMU)透明地将所需的地址(现在称为虚拟地址)提供给查找机制,即所谓的页表,并获得一个新值,该值被解释为物理地址.现在,原始操作在物理内存中的这个新的转换地址上运行,即使用户只看到虚拟地址.

分页的主要好处是页表由操作系统管理.因此,操作系统可以任意地修改和替换页表,例如在"切换任务"时.它可以保留整个页表集合,每个"进程"一个,每当它决定某个特定进程将在给定的CPU上运行时,它会将进程的页表加载到该CPU的MMU中(每个CPU都有自己的页表集).结果是每个进程看到自己的虚拟地址空间,无论在OS必须为其分配内存时哪些物理页面都是空闲的,它们看起来都是相同的.它永远不会知道任何其他进程的内存,因为它无法直接访问物理内存.

页表是存储在普通存储器中的嵌套树状数据结构,由OS编写但由硬件直接读取,因此格式是固定的.通过设置一个特殊的CPU控制寄存器指向顶级表,它们被"加载"到MMU中.CPU使用称为TLB的缓存来记住查找,因此对于相同的几个页面的重复访问比分散访问快得多,原因是TLB-miss以及通常的数据缓存原因.通常会看到术语"TLB条目"用于引用页表条目,即使它们未在TLB中缓存.

如果您担心进程可能只是禁用分页或尝试修改页表:这是不允许的,因为x86实现了权限级别(称为"环"),并且用户代码执行的权限级别太低而不允许它修改CPU的页表.