Merge pull request #8 from 0xAX/master

Merge Author
This commit is contained in:
慕冬亮 2015-08-20 16:05:05 +08:00
commit 7c85e46571
4 changed files with 71 additions and 71 deletions

View File

@ -18,9 +18,9 @@ Preparation before the kernel compilation
---------------------------------------------------------------------------------
There are many things to prepare before the kernel compilation can be started. The main point here is to find and configure
The type of compilation, to parse command line arguments that are passed to `make`, etc... So let's dive into the top `Makefile` of the Linux kernel.
the type of compilation, to parse command line arguments that are passed to `make`, etc... So let's dive into the top `Makefile` of Linux kernel.
The Linux kernel top `Makefile` is responsible for building two major products: [vmlinux](https://en.wikipedia.org/wiki/Vmlinux) (the resident kernel image) and the modules (any module files). The [Makefile](https://github.com/torvalds/linux/blob/master/Makefile) of the Linux kernel starts with the definition of the following variables:
The top `Makefile` of Linux kernel is responsible for building two major products: [vmlinux](https://en.wikipedia.org/wiki/Vmlinux) (the resident kernel image) and the modules (any module files). The [Makefile](https://github.com/torvalds/linux/blob/master/Makefile) of the Linux kernel starts with the definition of following variables:
```Makefile
VERSION = 4
@ -30,13 +30,13 @@ EXTRAVERSION = -rc3
NAME = Hurr durr I'ma sheep
```
These variables determine the current version of the Linux kernel and are used in the different places, for example in the forming of the `KERNELVERSION` variable:
These variables determine the current version of Linux kernel and are used in different places, for example in the forming of the `KERNELVERSION` variable in the same `Makefile`:
```Makefile
KERNELVERSION = $(VERSION)$(if $(PATCHLEVEL),.$(PATCHLEVEL)$(if $(SUBLEVEL),.$(SUBLEVEL)))$(EXTRAVERSION)
```
After this we can see a couple of `ifeq` conditionals that check some of the parameters passed to `make`. The Linux kernel `makefiles` provides a special `make help` target that prints all available targets and some of the command line arguments that can be passed to `make`. For example: `make V=1` - provides verbose builds. The first `ifeq` checks if the `V=n` option is passed to make:
After this we can see a couple of `ifeq` conditions that check some of the parameters passed to `make`. The Linux kernel `makefiles` provides a special `make help` target that prints all available targets and some of the command line arguments that can be passed to `make`. For example : `make V=1` => verbose build. The first `ifeq` checks whether the `V=n` option is passed to `make`:
```Makefile
ifeq ("$(origin V)", "command line")
@ -57,7 +57,7 @@ endif
export quiet Q KBUILD_VERBOSE
```
If this option is passed to `make` we set the `KBUILD_VERBOSE` variable to the value of the `V` option. Otherwise we set the `KBUILD_VERBOSE` variable to zero. After this we check value of the `KBUILD_VERBOSE` variable and set values of the `quiet` and `Q` variables depends on the `KBUILD_VERBOSE` value. The `@` symbols suppress the output of the command and if it is present before a command the output will be something like this: `CC scripts/mod/empty.o` instead of `Compiling .... scripts/mod/empty.o`. In the end we just export all of these variables. The next `ifeq` statement checks that `O=/dir` option was passed to the `make`. This option allows to locate all output files in the given `dir`:
If this option is passed to `make`, we set the `KBUILD_VERBOSE` variable to the value of `V` option. Otherwise we set the `KBUILD_VERBOSE` variable to zero. After this we check the value of `KBUILD_VERBOSE` variable and set values of the `quiet` and `Q` variables depending on the value of `KBUILD_VERBOSE` variable. The `@` symbols suppress the output of command. And if it is present before a command the output will be something like this: `CC scripts/mod/empty.o` instead of `Compiling .... scripts/mod/empty.o`. In the end we just export all of these variables. The next `ifeq` statement checks that `O=/dir` option was passed to the `make`. This option allows to locate all output files in the given `dir`:
```Makefile
ifeq ($(KBUILD_SRC),)
@ -82,14 +82,14 @@ endif # ifneq ($(KBUILD_OUTPUT),)
endif # ifeq ($(KBUILD_SRC),)
```
We check the `KBUILD_SRC` that represents the top directory of the kernel source code and if it is empty (it is empty when the makefile is executed for the first timea.) We then set the `KBUILD_OUTPUT` variable to the value that passed with the `O` option (if this option was passed). In the next step we check this `KBUILD_OUTPUT` variable and if it is set, we do following things:
We check the `KBUILD_SRC` that represents the top directory of the kernel source code and whether it is empty (it is empty when the makefile is executed for the first time). We then set the `KBUILD_OUTPUT` variable to the value passed with the `O` option (if this option was passed). In the next step we check this `KBUILD_OUTPUT` variable and if it is set, we do following things:
* Store value of the `KBUILD_OUTPUT` in the temp `saved-output` variable;
* Try to create given output directory;
* Check that directory created, in other way print error;
* Store the value of `KBUILD_OUTPUT` in the temporary `saved-output` variable;
* Try to create the given output directory;
* Check that directory created, in other way print error message;
* If the custom output directory was created successfully, execute `make` again with the new directory (see the `-C` option).
The next `ifeq` statements checks that the `C` or `M` options were passed to `make`:
The next `ifeq` statements check that the `C` or `M` options passed to `make`:
```Makefile
ifeq ("$(origin C)", "command line")
@ -104,7 +104,7 @@ ifeq ("$(origin M)", "command line")
endif
```
The `C` option tells the `makefile` that we need to check all `c` source code with a tool provided by the `$CHECK` environment variable, by default it is [sparse](https://en.wikipedia.org/wiki/Sparse). The second `M` option provides build for the external modules (will not see this case in this part). We also check if the `KBUILD_SRC` variable is set, and if it isn't we set the `srctree` variable to `.`:
The `C` option tells the `makefile` that we need to check all `c` source code with a tool provided by the `$CHECK` environment variable, by default it is [sparse](https://en.wikipedia.org/wiki/Sparse). The second `M` option provides build for the external modules (will not see this case in this part). We also check whether the `KBUILD_SRC` variable is set, and if it isn't, we set the `srctree` variable to `.`:
```Makefile
ifeq ($(KBUILD_SRC),)
@ -118,7 +118,7 @@ obj := $(objtree)
export srctree objtree VPATH
```
That tells to `Makefile` that the kernel source tree will be in the current directory where `make` was executed. We then set `objtree` and other variables to this directory and export them. The next step is the getting value for the `SUBARCH` variable that represents what the underlying architecture is:
That tells `Makefile` that the kernel source tree will be in the current directory where `make` was executed. We then set `objtree` and other variables to this directory and export them. The next step is to get value for the `SUBARCH` variable that represents what the underlying architecture is:
```Makefile
SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
@ -129,7 +129,7 @@ SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
-e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ )
```
As you can see it executes the [uname](https://en.wikipedia.org/wiki/Uname) util that prints information about machine, operating system and architecture. As it gets the output of `uname`, it parses it and assigns the result to the `SUBARCH` variable. Now that we have `SUBARCH`, we set the `SRCARCH` variable that provides the directory of the certain architecture and `hfr-arch` that provides directory for the header files:
As you can see, it executes the [uname](https://en.wikipedia.org/wiki/Uname) util that prints information about machine, operating system and architecture. As it gets the output of `uname`, it parses the ouput and assigns the result to the `SUBARCH` variable. Now that we have `SUBARCH`, we set the `SRCARCH` variable that provides the directory of the certain architecture and `hfr-arch` that provides the directory for the header files:
```Makefile
ifeq ($(ARCH),i386)
@ -142,7 +142,7 @@ endif
hdr-arch := $(SRCARCH)
```
Note that `ARCH` is an alias for `SUBARCH`. In the next step we set the `KCONFIG_CONFIG` variable that represents path to the kernel configuration file and if it was not set before, it is set to `.config` by default:
Note `ARCH` is an alias for `SUBARCH`. In the next step we set the `KCONFIG_CONFIG` variable that represents path to the kernel configuration file and if it was not set before, it is set to `.config` by default:
```Makefile
KCONFIG_CONFIG ?= .config
@ -166,7 +166,7 @@ HOSTCFLAGS = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -fomit-frame-p
HOSTCXXFLAGS = -O2
```
Next we get to the `CC` variable that represents compiler too, so why do we need the `HOST*` variables? `CC` is the target compiler that will be used during kernel compilation, but `HOSTCC` will be used during compilation of the set of the `host` programs (we will see it soon). After this we can see definition of the `KBUILD_MODULES` and `KBUILD_BUILTIN` variables that are used to determine what to compile (kernel, modules or both):
Next we get to the `CC` variable that represents compiler too, so why do we need the `HOST*` variables? `CC` is the target compiler that will be used during kernel compilation, but `HOSTCC` will be used during compilation of the set of the `host` programs (we will see it soon). After this we can see the definition of `KBUILD_MODULES` and `KBUILD_BUILTIN` variables that are used to determine what to compile (kernel, modules or both):
```Makefile
KBUILD_MODULES :=
@ -177,13 +177,13 @@ ifeq ($(MAKECMDGOALS),modules)
endif
```
Here we can see definition of these variables and the value of the `KBUILD_BUILTIN` will depend on the `CONFIG_MODVERSIONS` kernel configuration parameter if we pass only `modules` to `make`. The next step is including of the:
Here we can see definition of these variables and the value of `KBUILD_BUILTIN` variable will depend on the `CONFIG_MODVERSIONS` kernel configuration parameter if we pass only `modules` to `make`. The next step is to include the `kbuild` file.
```Makefile
include scripts/Kbuild.include
```
`kbuild` file. The [Kbuild](https://github.com/torvalds/linux/blob/master/Documentation/kbuild/kbuild.txt) or `Kernel Build System` is the special infrastructure to manage the build of the kernel and its modules. The `kbuild` files has the same syntax that makefiles do. The [scripts/Kbuild.include](https://github.com/torvalds/linux/blob/master/scripts/Kbuild.include) file provides some generic definitions for the `kbuild` system. As we included this `kbuild` files we can see definition of the variables that are related to the different tools that will be used during kernel and modules compilation (like linker, compilers, utils from the [binutils](http://www.gnu.org/software/binutils/), etc...):
The [Kbuild](https://github.com/torvalds/linux/blob/master/Documentation/kbuild/kbuild.txt) or `Kernel Build System` is the special infrastructure to manage the build of the kernel and its modules. The `kbuild` files has the same syntax that makefiles do. The [scripts/Kbuild.include](https://github.com/torvalds/linux/blob/master/scripts/Kbuild.include) file provides some generic definitions for the `kbuild` system. As we included this `kbuild` files we can see definition of the variables that are related to the different tools that will be used during kernel and modules compilation (like linker, compilers, utils from the [binutils](http://www.gnu.org/software/binutils/), etc...):
```Makefile
AS = $(CROSS_COMPILE)as

View File

@ -1,29 +1,29 @@
Executable and Linkable Format
================================================================================
ELF (Executable and Linkable Format) is a standard file format for executable files and shared libraries. Linux, as well as, many UNIX-like operating systems uses this format. Let's look on structure of the ELF-64 Object File Format and some defintions in the linux kernel source code related with it.
ELF (Executable and Linkable Format) is a standard file format for executable files, object code, shared libraries, and core dumps. Linux, as well as, many other UNIX-like operating systems uses this format. Let's look on the structure of ELF-64 File Format and some defintions in the linux kernel source code related with it.
An ELF object file consists of the following parts:
An ELF file consists of the following parts:
* ELF header - describes the main characteristics of the object file: type, CPU architecture, the virtual address of the entry point, the size and offset the remaining parts, etc...;
* Program header table - listing the available segments and their attributes. Program header table need loaders for placing sections of the file as virtual memory segments;
* Section header table - contains description of the sections.
* ELF header - describes the main characteristics of the object file: type, CPU architecture, virtual address of the entry point, size and offset of the remaining parts, etc...;
* Program header table - lists the available segments and their attributes. Program header table needs loaders for placing sections of this file as virtual memory segments;
* Section header table - contains the description of sections.
Now let's look closer on these components.
**ELF header**
It's located in the beginning of the object file. It's main point is to locate all other parts of the object file. File header contains following fields:
It's located in the beginning of the object file. Its main point is to locate all other parts of the object file. ELF header contains following fields:
* ELF identification - array of bytes which helps to identify the file as an ELF object file and also provides information about general object file characteristic;
* Object file type - identifies the object file type. This field can describe that ELF file is a relocatable object file, executable file, etc...;
* ELF identification - array of bytes which helps identify this file as an ELF file and also provides information about general object file characteristics;
* Object file type - identifies the object file type. This field can describe whether this file is a relocatable file or executable file, etc...;
* Target architecture;
* Version of the object file format;
* Virtual address of the program entry point;
* File offset of the program header table;
* File offset of the section header table;
* Size of an ELF header;
* Size of a program header table entry;
* Size of the ELF header;
* Size of the program header table entry;
* and other fields...
You can find `elf64_hdr` structure which presents ELF64 header in the linux kernel source code:
@ -47,11 +47,11 @@ typedef struct elf64_hdr {
} Elf64_Ehdr;
```
This structure defined in the [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h)
This structure defines in the [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h)
**Sections**
All data is stored in sections in an Elf object file. Sections identified by index in the section header table. Section header contains following fields:
All data is stored in sections in an Elf file. Sections are identified by index in the section header table. Section header contains following fields:
* Section name;
* Section type;
@ -64,7 +64,7 @@ All data is stored in sections in an Elf object file. Sections identified by ind
* Address alignment boundary;
* Size of entries, if section has table;
And presented with the following `elf64_shdr` structure in the linux kernel:
And presented with the following `elf64_shdr` structure in the linux kernel source code:
```C
typedef struct elf64_shdr {
@ -83,7 +83,7 @@ typedef struct elf64_shdr {
**Program header table**
All sections are grouped into segments in an executable or shared object file. Program header is an array of structures which describe every segment. It looks like:
All sections are grouped into segments in an executable file or shared library. Program header table is an array of structures which describe every segment. It looks like:
```C
typedef struct elf64_phdr {
@ -98,16 +98,14 @@ typedef struct elf64_phdr {
} Elf64_Phdr;
```
in the linux kernel source code.
`elf64_phdr` structure defines in the same [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h).
`elf64_phdr` defined in the same [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h).
And ELF object file also contains other fields/structures which you can find in the [Documentation](http://www.uclibc.org/docs/elf-64-gen.pdf). Now let's look on the `vmlinux`.
And ELF file also contains other fields/structures which you can find in the [Documentation](http://www.uclibc.org/docs/elf-64-gen.pdf). Now let's look on the `vmlinux`.
vmlinux
--------------------------------------------------------------------------------
`vmlinux` is relocatable ELF object file too. So we can look at it with the `readelf` util. First of all let's look on a header:
`vmlinux` is an ELF file too. So we can look at it with the `readelf` util. First of all, let's look on the elf header of vmlinux:
```
$ readelf -h vmlinux
@ -144,15 +142,15 @@ ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0
So we can find it in the `vmlinux` with:
```
readelf -s vmlinux | grep ffffffff81000000
$ readelf -s vmlinux | grep ffffffff81000000
1: ffffffff81000000 0 SECTION LOCAL DEFAULT 1
65099: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 _text
90766: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 startup_64
```
Note that here is address of the `startup_64` routine is not `ffffffff80000000`, but `ffffffff81000000` and now i'll explain why.
Note that ,the address of `startup_64` routine is not `ffffffff80000000`, but `ffffffff81000000`. Now I'll explain why.
We can see following definition in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S):
We can see the following definition in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S):
```
. = __START_KERNEL;
@ -176,10 +174,11 @@ Where `__START_KERNEL` is:
`__START_KERNEL_map` is the value from documentation - `ffffffff80000000` and `__PHYSICAL_START` is `0x1000000`. That's why address of the `startup_64` is `ffffffff81000000`.
And the last we can get program headers from `vmlinux` with the following command:
At last we can get program headers from `vmlinux` with the following command:
```
readelf -l vmlinux
$ readelf -l vmlinux
Elf file type is EXEC (Executable file)
Entry point 0x1000000

View File

@ -4,19 +4,19 @@ Paging
Introduction
--------------------------------------------------------------------------------
In the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we finished to learn what and how kernel does on the earliest stage. In the next step kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many different things, before we can see how the kernel will run the first init process.
In the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many others things, before we can see how the kernel runs the first init process.
Yeah, there will be many different things, but many many and once again many work with **memory**.
In my view, memory management is one of the most complex part of the linux kernel and in system programming generally. So before we will proceed with the kernel initialization stuff, we will get acquainted with the paging.
In my view, memory management is one of the most complex part of the linux kernel and in system programming in general. This is why before we proceed with the kernel initialization stuff, we need to get acquainted with paging.
`Paging` is a process of translation a linear memory address to a physical address. If you have read previous parts, you can remember that we saw segmentation in the real mode when physical address calculated by shifting a segment register on four and adding offset. Or also we saw segmentation in the protected mode, where we used the tables of descriptors and base addresses from descriptors with offsets to calculate physical addresses. Now we are in 64-bit mode and that we will see paging.
`Paging` is a mechanism that translates a linear memory address to a physical address. If you have read the previous parts of this book, you may remember that we saw segmentation in real mode when physical addresses are calculated by shifting a segment register by four and adding an offset. We also saw segmentation in protected mode, where we used the descriptor tables and base addresses from descriptors with offsets to calculate the physical addresses. Now that we are in 64-bit mode, will see paging.
As Intel manual says:
As the Intel manual says:
> Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system where sections of a programs execution environment are mapped into physical memory as needed.
So... I will try to explain how paging works in theory in this post. Of course it will be closely related with the linux kernel for `x86_64`, but we will not go into deep details (at least in this post).
So... In this post I will try to explain the theory behind paging. Of course it will be closely related to the `x86_64` version of the linux kernel for, but we will not go into too much details (at least in this post).
Enabling paging
--------------------------------------------------------------------------------
@ -27,13 +27,13 @@ There are three paging modes:
* PAE paging;
* IA-32e paging.
We will see explanation only last mode here. To enable `IA-32e paging` paging mode need to do following things:
We will only explain the last mode here. To enable the `IA-32e paging` paging mode we need to do following things:
* set `CR0.PG` bit;
* set `CR4.PAE` bit;
* set `IA32_EFER.LME` bit.
* set the `CR0.PG` bit;
* set the `CR4.PAE` bit;
* set the `IA32_EFER.LME` bit.
We already saw setting of this bits in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
We already saw where those this bits were set in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
```assembly
movl $(X86_CR0_PG | X86_CR0_PE), %eax
@ -52,14 +52,14 @@ wrmsr
Paging structures
--------------------------------------------------------------------------------
Paging divides the linear address space into fixed-size pages. Pages can be mapped into the physical address space or even external storage. This fixed size is `4096` bytes for the `x86_64` linux kernel. For a linear address translation to a physical address used special structures. Every structure is `4096` bytes size and contains `512` entries (this only for `PAE` and `IA32_EFER.LME` modes). Paging structures are hierarchical and linux kernel uses 4 level paging for `x86_64`. CPU uses a part of the linear address to identify entry of the another paging structure which is at the lower level or physical memory region (`page frame`) or physical address in this region (`page offset`). The address of the top level paging structure located in the `cr3` register. We already saw this in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
Paging divides the linear address space into fixed-size pages. Pages can be mapped into the physical address space or even external storage. This fixed size is `4096` bytes for the `x86_64` linux kernel. To perform the linear address translation to a physical address special structures are used. Every structure is `4096` bytes size and contains `512` entries (this only for `PAE` and `IA32_EFER.LME` modes). Paging structures are hierarchical and the linux kernel uses 4 level of paging in the `x86_64` architecture. The CPU uses a part of the linear address to identify the entry in another paging structure which is at the lower level or physical memory region (`page frame`) or physical address in this region (`page offset`). The address of the top level paging structure located in the `cr3` register. We already saw this in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
```assembly
leal pgtable(%ebx), %eax
movl %eax, %cr3
```
We built page table structures and put the address of the top-level structure to the `cr3` register. Here `cr3` is used to store the address of the top-level `PML4` structure or `Page Global Directory` as it calls in linux kernel. `cr3` is 64-bit register and has the following structure:
We built the page table structures and put the address of the top-level structure in the `cr3` register. Here `cr3` is used to store the address of the top-level structure, the `PML4` or `Page Global Directory` as it is called in the linux kernel. `cr3` is 64-bit register and has the following structure:
```
63 52 51 32
@ -86,7 +86,7 @@ These fields have the following meanings:
The linear address translation address is following:
* Given linear address arrives to the [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) instead of memory bus.
* A given linear address arrives to the [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) instead of memory bus.
* 64-bit linear address splits on some parts. Only low 48 bits are significant, it means that `2^48` or 256 TBytes of linear-address space may be accessed at any given time.
* `cr3` register stores the address of the 4 top-level paging structure.
* `47:39` bits of the given linear address stores an index into the paging structure level-4, `38:30` bits stores index into the paging structure level-3, `29:21` bits stores an index into the paging structure level-2, `20:12` bits stores an index into the paging structure level-1 and `11:0` bits provide the byte offset into the physical page.
@ -95,7 +95,7 @@ schematically, we can imagine it like this:
![4-level paging](http://oi58.tinypic.com/207mb0x.jpg)
Every access to a linear address is either a supervisor-mode access or a user-mode access. This access determined by the `CPL` (current privilege level). If `CPL < 3` it is a supervisor mode access level and user mode access level in other ways. For example top level page table entry contains access bits and has the following structure:
Every access to a linear address is either a supervisor-mode access or a user-mode access. This access is determined by the `CPL` (current privilege level). If `CPL < 3` it is a supervisor mode access level otherwise, otherwise it is a user mode access level. For example, the top level page table entry contains access bits and has the following structure:
```
63 62 52 51 32
@ -126,19 +126,19 @@ Where:
* R/W - read/write bit controls read/write access to the all physical pages mapped by this table entry;
* P - present bit. Current bit indicates was page table or physical page loaded into primary memory or not.
Ok, now we know about paging structures and it's entries. Let's see some details about 4-level paging in linux kernel.
Ok, we know about the paging structures and their entries. Now let's see some details about 4-level paging in the linux kernel.
Paging structures in linux kernel
Paging structures in the linux kernel
--------------------------------------------------------------------------------
As i wrote about linux kernel for `x86_64` uses 4-level page tables. Their names are:
As we've seen, the linux kernel in `x86_64` uses 4-level page tables. Their names are:
* Page Global Directory
* Page Upper Directory
* Page Middle Directory
* Page Table Entry
After that you compiled and installed linux kernel, you can note `System.map` file which stores address of the functions that are used by the kernel. Note that addresses are virtual. For example:
After you've compiled and installed the linux kernel, you can see the `System.map` file which stores the virtual addresses of the functions that are used by the kernel. For example:
```
$ grep "start_kernel" System.map
@ -146,7 +146,7 @@ ffffffff81efe497 T x86_64_start_kernel
ffffffff81efeaa2 T start_kernel
```
We can see `0xffffffff81efe497` here. I'm not sure that you have so big RAM. But anyway `start_kernel` and `x86_64_start_kernel` will be executed. The address space in `x86_64` is `2^64` size, but it's too large, that's why used smaller address space, only 48-bits wide. So we have situation when physical address limited with 48 bits, but addressing still performed with 64 bit pointers. How to solve this problem? Ok, look on the diagram:
We can see `0xffffffff81efe497` here. I doubt you really have that much RAM installed. But anyway, `start_kernel` and `x86_64_start_kernel` will be executed. The address space in `x86_64` is `2^64` size, but it's too large, that's why a smaller address space is used, only 48-bits wide. So we have a situation where the physical address space is limited to 48 bits, but addressing still performed with 64 bit pointers. How is this problem solved? Look at this diagram:
```
0xffffffffffffffff +-----------+
@ -166,12 +166,12 @@ We can see `0xffffffff81efe497` here. I'm not sure that you have so big RAM. But
0x0000000000000000+-----------+
```
This solution is `sign extension`. Here we can see that low 48 bits of a virtual address can be used for addressing. Bits `63:48` can be or 0 or 1. Note that all virtual address space is spliten on 2 parts:
This solution is `sign extension`. Here we can see that the lower 48 bits of a virtual address can be used for addressing. Bits `63:48` can be either only zeroes or only ones. Note that the virtual address space is split in 2 parts:
* Kernel space
* Userspace
Userspace occupies the lower part of the virtual address space, from `0x000000000000000` to `0x00007fffffffffff` and kernel space occupies the highest part from the `0xffff8000000000` to `0xffffffffffffffff`. Note that bits `63:48` is 0 for userspace and 1 for kernel space. All addresses which are in kernel space and in userspace or in another words which higher `63:48` bits zero or one calls `canonical` addresses. There is `non-canonical` area between these memory regions. Together this two memory regions (kernel space and user space) are exactly `2^48` bits. We can find virtual memory map with 4 level page tables in the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt):
Userspace occupies the lower part of the virtual address space, from `0x000000000000000` to `0x00007fffffffffff` and kernel space occupies the highest part from `0xffff8000000000` to `0xffffffffffffffff`. Note that bits `63:48` is 0 for userspace and 1 for kernel space. All addresses which are in kernel space and in userspace or in other words which higher `63:48` bits are zeroes or ones are called `canonical` addresses. There is a `non-canonical` area between these memory regions. Together these two memory regions (kernel space and user space) are exactly `2^48` bits wide. We can find the virtual memory map with 4 level page tables in the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt):
```
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
@ -193,15 +193,15 @@ ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
```
We can see here memory map for user space, kernel space and non-canonical area between. User space memory map is simple. Let's take a closer look on the kernel space. We can see that it starts from the guard hole which reserved for hypervisor. We can find definition of this guard hole in the [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h):
We can see here the memory map for user space, kernel space and the non-canonical area in-between them. The user space memory map is simple. Let's take a closer look at the kernel space. We can see that it starts from the guard hole which is reserved for the hypervisor. We can find the definition of this guard hole in [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h):
```C
#define __PAGE_OFFSET _AC(0xffff880000000000, UL)
```
Previously this guard hole and `__PAGE_OFFSET` was from `0xffff800000000000` to `0xffff80ffffffffff` for preventing of access to non-canonical area, but later was added 3 bits for hypervisor.
Previously this guard hole and `__PAGE_OFFSET` was from `0xffff800000000000` to `0xffff80ffffffffff` to prevent access to non-canonical area, but was later extended by 3 bits for the hypervisor.
Next is the lowest usable address in kernel space - `ffff880000000000`. This virtual memory region is for direct mapping of the all physical memory. After the memory space which mapped all physical address - guard hole, it needs to be between direct mapping of the all physical memory and vmalloc area. After the virtual memory map for the first terabyte and unused hole after it, we can see `kasan` shadow memory. It was added by the [commit](https://github.com/torvalds/linux/commit/ef7f0d6a6ca8c9e4b27d78895af86c2fbfaeedb2) and provides kernel address sanitizer. After next unused hole we can se `esp` fixup stacks (we will talk about it in the other parts) and the start of the kernel text mapping from the physical address - `0`. We can find definition of this address in the same file as the `__PAGE_OFFSET`:
Next is the lowest usable address in kernel space - `ffff880000000000`. This virtual memory region is for direct mapping of the all physical memory. After the memory space which maps all physical addresses, the guard hole. It needs to be between the direct mapping of all the physical memory and the vmalloc area. After the virtual memory map for the first terabyte and the unused hole after it, we can see the `kasan` shadow memory. It was added by [commit](https://github.com/torvalds/linux/commit/ef7f0d6a6ca8c9e4b27d78895af86c2fbfaeedb2) and provides the kernel address sanitizer. After the next unused hole we can see the `esp` fixup stacks (we will talk about it in other parts of this book) and the start of the kernel text mapping from the physical address - `0`. We can find the definition of this address in the same file as the `__PAGE_OFFSET`:
```C
#define __START_KERNEL_map _AC(0xffffffff80000000, UL)
@ -218,9 +218,9 @@ readelf -s vmlinux | grep ffffffff81000000
Here i checked `vmlinux` with the `CONFIG_PHYSICAL_START` is `0x1000000`. So we have the start point of the kernel `.text` - `0xffffffff80000000` and offset - `0x1000000`, the resulted virtual address will be `0xffffffff80000000 + 1000000 = 0xffffffff81000000`.
After the kernel `.text` region, we can see virtual memory region for kernel modules, `vsyscalls` and 2 megabytes unused hole.
After the kernel `.text` region there is the virtual memory region for kernel modules, `vsyscalls` and an unused hole of 2 megabytes.
We know how looks kernel's virtual memory map and now we can see how a virtual address translates into physical. Let's take for example following address:
We've seen how the kernel's virtual memory map is laid out and how a virtual address is translated into a physical one. Let's take for example following address:
```
0xffffffff81000000
@ -233,7 +233,7 @@ In binary it will be:
63:48 47:39 38:30 29:21 20:12 11:0
```
The given virtual address split on some parts as i wrote above:
This virtual address is split in parts as described above:
* `63:48` - bits not used;
* `47:39` - bits of the given linear address stores an index into the paging structure level-4;
@ -242,14 +242,14 @@ The given virtual address split on some parts as i wrote above:
* `20:12` - bits stores an index into the paging structure level-1;
* `11:0` - bits provide the byte offset into the physical page.
That is all. Now you know a little about `paging` theory and we can go ahead in the kernel source code and see first initialization steps.
That is all. Now you know a little about theory of `paging` and we can go ahead in the kernel source code and see the first initialization steps.
Conclusion
--------------------------------------------------------------------------------
It's the end of this short part about paging theory. Of course this post doesn't cover all details about paging, but soon we will see it on practice how linux kernel builds paging structures and work with it.
It's the end of this short part about paging theory. Of course this post doesn't cover every detail of paging, but soon we'll see in practice how the linux kernel builds paging structures and works with them.
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
**Please note that English is not my first language and I am really sorry for any inconvenience. If you've found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
Links

View File

@ -66,3 +66,4 @@ Thank you to all contributors:
* [Waqar Ahmed](https://github.com/Waqar144)
* [Ian Miell](https://github.com/ianmiell)
* [DongLiang Mu](https://github.com/mudongliang)
* [Johan Manuel](https://github.com/29jm)