Kernel initialization. Part 1. ================================================================================ First steps in the kernel code -------------------------------------------------------------------------------- In the previous post (`Kernel booting process. Part 5.`) - [Kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) we stopped at the [jump](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) on the decompressed kernel: ```assembly jmp *%rax ``` and now we are in the kernel. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489) will be called. So let's start. First steps in the kernel -------------------------------------------------------------------------------- Okay, we got address of the kernel from the `decompress_kernel` function into `rax` register and just jumped there. Decompressed kernel code starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): ```assembly __HEAD .code64 .globl startup_64 startup_64: ... ... ... ``` We can see definition of the `startup_64` routine and it defined in the `__HEAD` section, which is just: ```C #define __HEAD .section ".head.text","ax" ``` We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S#L93) linker script: ``` .text : AT(ADDR(.text) - LOAD_OFFSET) { _text = .; ... ... ... } :text = 0x9090 ``` We can understand default virtual and physical addresses from the linker script. Note that address of the `_text` is location counter which is defined as: ``` . = __START_KERNEL; ``` for `x86_64`. We can find definition of the `__START_KERNEL` macro in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h): ```C #define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) #define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN) ``` Here we can see that `__START_KERNEL` is the sum of the `__START_KERNEL_map` (which is `0xffffffff80000000`, see post about [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)) and `__PHYSICAL_START`. Where `__PHYSICAL_START` is aligned value of the `CONFIG_PHYSICAL_START`. So if you will not use [kASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization) and will not change `CONFIG_PHYSICAL_START` in the configuration addresses will be following: * Physical address - `0x1000000`; * Virtual address - `0xffffffff81000000`. Now we know default physical and virtual addresses of the `startup_64` routine, but to know actual addresses we must to calculate it with the following code: ```assembly leaq _text(%rip), %rbp subq $_text - __START_KERNEL_map, %rbp ``` Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled address of the `_text` is `0xffffffff81000000` and `__START_KERNEL_map` contains `0xffffffff81000000`, so `rbp` will contain physical address of the `text` - `0x1000000` after this calculation. We need to calculate it because kernel can't be run on the default address, but now we know the actual physical address. In the next step we checks that this address is aligned with: ```assembly movq %rbp, %rax andl $~PMD_PAGE_MASK, %eax testl %eax, %eax jnz bad_address ``` Here we just put address to the `%rax` and test first bit. `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) about it) and defined as: ```C #define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1)) #define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT) #define PMD_SHIFT 21 ``` As we can easily calculate, `PMD_PAGE_SIZE` is 2 megabytes. Here we use standard formula for checking alignment and if `text` address is not aligned for 2 megabytes, we jump to `bad_address` label. After this we check address that it is not too large: ```assembly leaq _text(%rip), %rax shrq $MAX_PHYSMEM_BITS, %rax jnz bad_address ``` Address most not be greater than 46-bits: ```C #define MAX_PHYSMEM_BITS 46 ``` Okay, we did some early checks and now we can move on. Fix base addresses of page tables -------------------------------------------------------------------------------- The first step before we started to setup identity paging, need to correct following addresses: ```assembly addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) addq %rbp, level3_kernel_pgt + (510*8)(%rip) addq %rbp, level3_kernel_pgt + (511*8)(%rip) addq %rbp, level2_fixmap_pgt + (506*8)(%rip) ``` Here we need to correct `early_level4_pgt` and other addresses of the page table directories, because as I wrote above, kernel can't be run at the default `0x1000000` address. `rbp` register contains actual address so we add to the `early_level4_pgt`, `level3_kernel_pgt` and `level2_fixmap_pgt`. Let's try to understand what these labels means. First of all let's look on their definition: ```assembly NEXT_PAGE(early_level4_pgt) .fill 511,8,0 .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE NEXT_PAGE(level3_kernel_pgt) .fill L3_START_KERNEL,8,0 .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE .quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE NEXT_PAGE(level2_kernel_pgt) PMDS(0, __PAGE_KERNEL_LARGE_EXEC, KERNEL_IMAGE_SIZE/PMD_SIZE) NEXT_PAGE(level2_fixmap_pgt) .fill 506,8,0 .quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE .fill 5,8,0 NEXT_PAGE(level1_fixmap_pgt) .fill 512,8,0 ``` Looks hard, but it is not true. First of all let's look on the `early_level4_pgt`. It starts with the (4096 - 8) bytes of zeros, it means that we don't use first 511 `early_level4_pgt` entries. And after this we can see `level3_kernel_pgt` entry. Note that we subtract `__START_KERNEL_map + _PAGE_TABLE` from it. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical address of the `level3_kernel_pgt`. Now let's look on `_PAGE_TABLE`, it is just page entry access rights: ```C #define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_DIRTY) ``` more about it, you can read in the [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) post. `level3_kernel_pgt` - stores entries which map kernel space. At the start of it's definition, we can see that it filled with zeros `L3_START_KERNEL` times. Here `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After it we can see definition of two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has: ```C #define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \ _PAGE_DIRTY) ``` access rights. The second - `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. The next `level2_kernel_pgt` calls `PDMS` macro which creates 512 megabytes from the `__START_KERNEL_map` for kernel text (after these 512 megabytes will be modules memory space). Now we know Let's back to our code which is in the beginning of the section. Remember that `rbp` contains actual physical address of the `_text` section. We just add this address to the base address of the page tables, that they'll have correct addresses: ```assembly addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip) addq %rbp, level3_kernel_pgt + (510*8)(%rip) addq %rbp, level3_kernel_pgt + (511*8)(%rip) addq %rbp, level2_fixmap_pgt + (506*8)(%rip) ``` At the first line we add `rbp` to the `early_level4_pgt`, at the second line we add `rbp` to the `level2_kernel_pgt`, at the third line we add `rbp` to the `level2_fixmap_pgt` and add `rbp` to the `level1_fixmap_pgt`. After all of this we will have: ``` early_level4_pgt[511] -> level3_kernel_pgt[0] level3_kernel_pgt[510] -> level2_kernel_pgt[0] level3_kernel_pgt[511] -> level2_fixmap_pgt[0] level2_kernel_pgt[0] -> 512 MB kernel mapping level2_fixmap_pgt[506] -> level1_fixmap_pgt ``` As we corrected base addresses of the page tables, we can start to build it. Identity mapping setup -------------------------------------------------------------------------------- Now we can see set up the identity mapping early page tables. Identity Mapped Paging is a virtual addresses which are mapped to physical addresses that have the same value, `1 : 1`. Let's look on it in details. First of all we get the `rip-relative` address of the `_text` and `_early_level4_pgt` and put they into `rdi` and `rbx` registers: ```assembly leaq _text(%rip), %rdi leaq early_level4_pgt(%rip), %rbx ``` After this we store physical address of the `_text` in the `rax` and get the index of the page global directory entry which stores `_text` address, by shifting `_text` address on the `PGDIR_SHIFT`: ```assembly movq %rdi, %rax shrq $PGDIR_SHIFT, %rax leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx movq %rdx, 0(%rbx,%rax,8) movq %rdx, 8(%rbx,%rax,8) ``` where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories: ```C #define PGDIR_SHIFT 39 #define PUD_SHIFT 30 #define PMD_SHIFT 21 ``` After this we put the address of the first `level3_kernel_pgt` to the `rdx` with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `level3_kernel_pgt` entries. After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now contains the address of the first entry of the `level3_kernel_pgt`) and put `rdi` (it now contains physical address of the `_text`) to the `rax`. And after this we write addresses of the two page upper directory entries to the `level3_kernel_pgt`: ```assembly addq $4096, %rdx movq %rdi, %rax shrq $PUD_SHIFT, %rax andl $(PTRS_PER_PUD-1), %eax movq %rdx, 4096(%rbx,%rax,8) incl %eax andl $(PTRS_PER_PUD-1), %eax movq %rdx, 4096(%rbx,%rax,8) ``` In the next step we write addresses of the page middle directory entries to the `level2_kernel_pgt` and the last step is correcting of the kernel text+data virtual addresses: ```assembly leaq level2_kernel_pgt(%rip), %rdi leaq 4096(%rdi), %r8 1: testq $1, 0(%rdi) jz 2f addq %rbp, 0(%rdi) 2: addq $8, %rdi cmp %r8, %rdi jne 1b ``` Here we put the address of the `level2_kernel_pgt` to the `rdi` and address of the page table entry to the `r8` register. Next we check the present bit in the `level2_kernel_pgt` and if it is zero we're moving to the next page by adding 8 bytes to `rdi` which contaitns address of the `level2_kernel_pgt`. After this we compare it with `r8` (contains address of the page table entry) and go back to label `1` or move forward. In the next step we correct `phys_base` physical address with `rbp` (contains physical address of the `_text`), put physical address of the `early_level4_pgt` and jump to label `1`: ```assembly addq %rbp, phys_base(%rip) movq $(early_level4_pgt - __START_KERNEL_map), %rax jmp 1f ``` where `phys_base` mathes the first entry of the `level2_kernel_pgt` which is 512 MB kernel mapping. Last preparations -------------------------------------------------------------------------------- After that we jumped to the label `1` we enable `PAE`, `PGE` (Paging Global Extension) and put the physical address of the `phys_base` (see above) to the `rax` register and fill `cr3` register with it: ```assembly 1: movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx movq %rcx, %cr4 addq phys_base(%rip), %rax movq %rax, %cr3 ``` In the next step we check that CPU support [NX](http://en.wikipedia.org/wiki/NX_bit) bit with: ```assembly movl $0x80000001, %eax cpuid movl %edx,%edi ``` We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`. Now we put `0xc0000080` or `MSR_EFER` to the `ecx` and call `rdmsr` instruction for the reading model specific register. ```assembly movl $MSR_EFER, %ecx rdmsr ``` The result will be in the `edx:eax`. General view of the `EFER` is following: ``` 63 32 -------------------------------------------------------------------------------- | | | Reserved MBZ | | | -------------------------------------------------------------------------------- 31 16 15 14 13 12 11 10 9 8 7 1 0 -------------------------------------------------------------------------------- | | T | | | | | | | | | | | Reserved MBZ | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE| | | E | | | | | | | | | | -------------------------------------------------------------------------------- ``` We will not see all fields in details here, but we will learn about this and other `MSRs` in the special part about. As we read `EFER` to the `edx:eax`, we checks `_EFER_SCE` or zero bit which is `System Call Extensions` with `btsl` instruction and set it to one. By the setting `SCE` bit we enable `SYSCALL` and `SYSRET` instructions. In the next step we check 20th bit in the `edi`, remember that this register stores result of the `cpuid` (see above). If `20` bit is set (`NX` bit) we just write `EFER_SCE` to the model specific register. ```assembly btsl $_EFER_SCE, %eax btl $20,%edi jnc 1f btsl $_EFER_NX, %eax btsq $_PAGE_BIT_NX,early_pmd_flags(%rip) 1: wrmsr ``` If `NX` bit is supported we enable `_EFER_NX` and write it too, with the `wrmsr` instruction. In the next step we need to update Global Descriptor table with `lgdt` instruction: ```assembly lgdt early_gdt_descr(%rip) ``` where Global Descriptor table defined as: ```assembly early_gdt_descr: .word GDT_ENTRIES*8-1 early_gdt_descr_base: .quad INIT_PER_CPU_VAR(gdt_page) ``` We need to reload Global Descriptor Table because now kernel works in the userspace addresses, but soon kernel will work in it's own space. Now let's look on `early_gdt_descr` definition. Global Descriptor Table contains 32 entries: ```C #define GDT_ENTRIES 32 ``` for kernel code, data, thread local storage segments and etc... it's simple. Now let's look on the `early_gdt_descr_base`. First of `gdt_page` defined as: ```C struct gdt_page { struct desc_struct gdt[GDT_ENTRIES]; } __attribute__((aligned(PAGE_SIZE))); ``` in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h). It contains one field `gdt` which is array of the `desc_struct` structures which defined as: ```C struct desc_struct { union { struct { unsigned int a; unsigned int b; }; struct { u16 limit0; u16 base0; unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1; unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8; }; }; } __attribute__((packed)); ``` and presents familiar to us GDT descriptor. Also we can note that `gdt_page` structure aligned to `PAGE_SIZE` which is 4096 bytes. It means that `gdt` will occupy one page. Now let's try to understand what is it `INIT_PER_CPU_VAR`. `INIT_PER_CPU_VAR` is a macro which defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h) and just concats `init_per_cpu__` with the given parameter: ```C #define INIT_PER_CPU_VAR(var) init_per_cpu__##var ``` After this we have `init_per_cpu__gdt_page`. We can see in the [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S): ``` #define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load INIT_PER_CPU(gdt_page); ``` As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT. Generally per-CPU variables is a 2.6 kernel feature. You can understand what is it from it's name. When we create `per-CPU` variable, each CPU will have will have it's own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with it's own copy of variable and etc... So every core on multiprocessor will have it's own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) post. As we loaded new Global Descriptor Table, we reload segments as we did it every time: ```assembly xorl %eax,%eax movl %eax,%ds movl %eax,%ss movl %eax,%es movl %eax,%fs movl %eax,%gs ``` After all of these steps we set up `gs` register that it post to the `irqstack` (we will see information about it in the next parts): ```assembly movl $MSR_GS_BASE,%ecx movl initial_gs(%rip),%eax movl initial_gs+4(%rip),%edx wrmsr ``` where `MSR_GS_BASE` is: ```C #define MSR_GS_BASE 0xc0000101 ``` We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which are point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressation in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains descriptor which mapped to Model specific registers. So we can see above `0xc0000101` is a `gs.base` MSR address. In the next step we put the address of the real mode bootparam structure to the `rdi` (remember `rsi` holds pointer to this structure from the start) and jump to the C code with: ```assembly movq initial_code(%rip),%rax pushq $0 pushq $__KERNEL_CS pushq %rax lretq ``` Here we put the address of the `initial_code` to the `rax` and push fake address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` defined in the same source code file and looks: ```assembly __REFDATA .balign 8 GLOBAL(initial_code) .quad x86_64_start_kernel ... ... ... ``` As we can see `initial_code` contains address of the `x86_64_start_kernel`, which defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and looks like this: ```C asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) { ... ... ... } ``` It has one argument is a `real_mode_data` (remember that we passed address of the real mode data to the `rdi` register previously). This is first C code in the kernel! Next to start_kernel -------------------------------------------------------------------------------- We need to see last preparations before we can see "kernel entry point" - start_kernel function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L489). First of all we can see some checks in the `x86_64_start_kernel` function: ```C BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map); BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE); BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE); BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0); BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0); BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL)); BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK))); BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END); ``` There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - `__STAT_KERNEL_map`, that kernel text with modules is not less than image of the kernel and etc... `BUILD_BUG_ON` is a macro which looks as: ```C #define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) ``` Let's try to understand this trick works. Let's take for example first condition: `MODULES_VADDR < __START_KERNEL_map`. `!!conditions` is the same that `condition != 0`. So it means if `MODULES_VADDR < __START_KERNEL_map` is true, we will get `1` in the `!!(condition)` or zero if not. After `2*!!(condition)` we will get or `2` or `0`. In the end of calculations we can get two different behaviors: * We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because `MODULES_VADDR` can't be less than `__START_KERNEL_map` will be in our case); * No compilation errors. That's all. So interesting C trick for getting compile error which depends on some constants. In the next step we can see call of the `cr4_init_shadow` function which stores shadow copy of the `cr4` per cpu. Context switches can change bits in the `cr4` so we need to store `cr4` for each CPU. And after this we can see call of the `reset_early_page_tables` function where we resets all page global directory entries and write new pointer to the PGT in `cr3`: ```C for (i = 0; i < PTRS_PER_PGD-1; i++) early_level4_pgt[i].pgd = 0; next_early_pgt = 0; write_cr3(__pa_nodebug(early_level4_pgt)); ``` soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (`PTRS_PER_PGD` is `512`) in the loop and make it zero. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_level4_pgt` to the `cr3`. `__pa_nodebug` is a macro which will be expanded to: ```C ((unsigned long)(x) - __START_KERNEL_map + phys_base) ``` After this we clear `_bss` from the `__bss_stop` to `__bss_start` and the next step will be setup of the early `IDT` handlers, but it's big theme so we will see it in the next part. Conclusion -------------------------------------------------------------------------------- This is the end of the first part about linux kernel initialization. If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and many many more. **Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-insides).** Links -------------------------------------------------------------------------------- * [Model Specific Register](http://en.wikipedia.org/wiki/Model-specific_register) * [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) * [Previous part - Kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) * [NX](http://en.wikipedia.org/wiki/NX_bit) * [ASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization)