Kernel initialization. Part 3. ================================================================================ Last preparations before the kernel entry point -------------------------------------------------------------------------------- This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue to dive into the linux kernel initialization process in the current part. Our next point is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we call the `start_kernel` function, we must do some preparations. So let's continue. boot_params again -------------------------------------------------------------------------------- In the previous part we stopped at setting Interrupt Descriptor Table and loading it in the `IDTR` register. At the next step after this we can see a call of the `copy_bootdata` function: ```C copy_bootdata(__va(real_mode_data)); ``` This function takes one argument - virtual address of the `real_mode_data`. Remember that we passed the address of the `boot_params` structure from [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L114) to the `x86_64_start_kernel` function as first argument in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S): ``` /* rsi is pointer to real mode structure with interesting info. pass it to C */ movq %rsi, %rdi ``` Now let's look at `__va` macro. This macro defined in [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c): ```C #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) ``` where `PAGE_OFFSET` is `__PAGE_OFFSET` which is `0xffff880000000000` and the base virtual address of the direct mapping of all physical memory. So we're getting virtual address of the `boot_params` structure and pass it to the `copy_bootdata` function, where we copy `real_mod_data` to the `boot_params` which is declared in the [arch/x86/kernel/setup.h](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.h) ```C extern struct boot_params boot_params; ``` Let's look at the `copy_boot_data` implementation: ```C static void __init copy_bootdata(char *real_mode_data) { char * command_line; unsigned long cmd_line_ptr; memcpy(&boot_params, real_mode_data, sizeof boot_params); sanitize_boot_params(&boot_params); cmd_line_ptr = get_cmd_line_ptr(); if (cmd_line_ptr) { command_line = __va(cmd_line_ptr); memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE); } } ``` First of all, note that this function is declared with `__init` prefix. It means that this function will be used only during the initialization and used memory will be freed. We can see declaration of two variables for the kernel command line and copying `real_mode_data` to the `boot_params` with the `memcpy` function. The next call of the `sanitize_boot_params` function which fills some fields of the `boot_params` structure like `ext_ramdisk_image` and etc... if bootloaders which fail to initialize unknown fields in `boot_params` to zero. After this we're getting address of the command line with the call of the `get_cmd_line_ptr` function: ```C unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr; cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32; return cmd_line_ptr; ``` which gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check `cmd_line_ptr`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes: ```C extern char __initdata boot_command_line[]; ``` After this we will have copied kernel command line and `boot_params` structure. In the next step we can see call of the `load_ucode_bsp` function which loads processor microcode, but we will not see it here. After microcode was loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initialized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code. Move on init pages -------------------------------------------------------------------------------- In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call: ```C clear_page(init_level4_pgt); ``` function and pass `init_level4_pgt` which also defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and looks: ```assembly NEXT_PAGE(init_level4_pgt) .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE .org init_level4_pgt + L4_PAGE_OFFSET*8, 0 .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE .org init_level4_pgt + L4_START_KERNEL*8, 0 .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE ``` which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/lib/clear_page_64.S) let's look on this function: ```assembly ENTRY(clear_page) CFI_STARTPROC xorl %eax,%eax movl $4096/64,%ecx .p2align 4 .Lloop: decl %ecx #define PUT(x) movq %rax,x*8(%rdi) movq %rax,(%rdi) PUT(1) PUT(2) PUT(3) PUT(4) PUT(5) PUT(6) PUT(7) leaq 64(%rdi),%rdi jnz .Lloop nop ret CFI_ENDPROC .Lclear_page_end: ENDPROC(clear_page) ``` As you can understand from the function name it clears or fills with zeros page tables. First of all note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which are expands to GNU assembly directives: ```C #define CFI_STARTPROC .cfi_startproc #define CFI_ENDPROC .cfi_endproc ``` and used for debugging. After `CFI_STARTPROC` macro we zero out `eax` register and put 64 to the `ecx` (it will be a counter). Next we can see loop which starts with the `.Lloop` label and it starts from the `ecx` decrement. After it we put zero from the `rax` register to the `rdi` which contains the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset on 8. After this we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` with 64-bytes offset to the `rdi` again and repeat all operations until `ecx` reaches zero. In the end we will have `init_level4_pgt` filled with zeros. As we have `init_level4_pgt` filled with zeros, we set the last `init_level4_pgt` entry to kernel high mapping with the: ```C init_level4_pgt[511] = early_level4_pgt[511]; ``` Remember that we dropped all `early_level4_pgt` entries in the `reset_early_page_table` function and kept only kernel high mapping there. The last step in the `x86_64_start_kernel` function is the call of the: ```C x86_64_start_reservations(real_mode_data); ``` function with the `real_mode_data` as argument. The `x86_64_start_reservations` function defined in the same source code file as the `x86_64_start_kernel` function and looks: ```C void __init x86_64_start_reservations(char *real_mode_data) { if (!boot_params.hdr.version) copy_bootdata(__va(real_mode_data)); reserve_ebda_region(); start_kernel(); } ``` You can see that it is the last function before we are in the kernel entry point - `start_kernel` function. Let's look what it does and how it works. Last step before kernel entry point -------------------------------------------------------------------------------- First of all we can see in the `x86_64_start_reservations` function the check for `boot_params.hdr.version`: ```C if (!boot_params.hdr.version) copy_bootdata(__va(real_mode_data)); ``` and if it is zero we call `copy_bootdata` function again with the virtual address of the `real_mode_data` (read about its implementation). In the next step we can see the call of the `reserve_ebda_region` function which defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head.c). This function reserves memory block for the `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area located in the top of conventional memory and contains data about ports, disk parameters and etc... Let's look on the `reserve_ebda_region` function. It starts from the checking is paravirtualization enabled or not: ```C if (paravirt_enabled()) return; ``` we exit from the `reserve_ebda_region` function if paravirtualization is enabled because if it enabled the extended bios data area is absent. In the next step we need to get the end of the low memory: ```C lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES); lowmem <<= 10; ``` We're getting the virtual address of the BIOS low memory in kilobytes and convert it to bytes with shifting it on 10 (multiply on 1024 in other words). After this we need to get the address of the extended BIOS data are with the: ```C ebda_addr = get_bios_ebda(); ``` where `get_bios_ebda` function defined in the [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bios_ebda.h) and looks like: ```C static inline unsigned int get_bios_ebda(void) { unsigned int address = *(unsigned short *)phys_to_virt(0x40E); address <<= 4; return address; } ``` Let's try to understand how it works. Here we can see that we converting physical address `0x40E` to the virtual, where `0x0040:0x000e` is the segment which contains base address of the extended BIOS data area. Don't worry that we are using `phys_to_virt` function for converting a physical address to virtual address. You can note that previously we have used `__va` macro for the same point, but `phys_to_virt` is the same: ```C static inline void *phys_to_virt(phys_addr_t address) { return __va(address); } ``` only with one difference: we pass argument with the `phys_addr_t` which depends on `CONFIG_PHYS_ADDR_T_64BIT`: ```C #ifdef CONFIG_PHYS_ADDR_T_64BIT typedef u64 phys_addr_t; #else typedef u32 phys_addr_t; #endif ``` This configuration option is enabled by `CONFIG_PHYS_ADDR_T_64BIT`. After that we got virtual address of the segment which stores the base address of the extended BIOS data area, we shift it on 4 and return. After this `ebda_addr` variables contains the base address of the extended BIOS data area. In the next step we check that address of the extended BIOS data area and low memory is not less than `INSANE_CUTOFF` macro ```C if (ebda_addr < INSANE_CUTOFF) ebda_addr = LOWMEM_CAP; if (lowmem < INSANE_CUTOFF) lowmem = LOWMEM_CAP; ``` which is: ```C #define INSANE_CUTOFF 0x20000U ``` or 128 kilobytes. In the last step we get lower part in the low memory and extended bios data area and call `memblock_reserve` function which will reserve memory region for extended bios data between low memory and one megabyte mark: ```C lowmem = min(lowmem, ebda_addr); lowmem = min(lowmem, LOWMEM_CAP); memblock_reserve(lowmem, 0x100000 - lowmem); ``` `memblock_reserve` function is defined at [mm/block.c](https://github.com/torvalds/linux/blob/master/mm/block.c) and takes two parameters: * base physical address; * region size. and reserves memory region for the given base address and size. `memblock_reserve` is the first function in this book from linux kernel memory manager framework. We will take a closer look on memory manager soon, but now let's look at its implementation. First touch of the linux kernel memory manager framework -------------------------------------------------------------------------------- In the previous paragraph we stopped at the call of the `memblock_reserve` function and as i sad before it is the first function from the memory manager framework. Let's try to understand how it works. `memblock_reserve` function just calls: ```C memblock_reserve_region(base, size, MAX_NUMNODES, 0); ``` function and passes 4 parameters there: * physical base address of the memory region; * size of the memory region; * maximum number of numa nodes; * flags. At the start of the `memblock_reserve_region` body we can see definition of the `memblock_type` structure: ```C struct memblock_type *_rgn = &memblock.reserved; ``` which presents the type of the memory block and looks: ```C struct memblock_type { unsigned long cnt; unsigned long max; phys_addr_t total_size; struct memblock_region *regions; }; ``` As we need to reserve memory block for extended bios data area, the type of the current memory region is reserved where `memblock` structure is: ```C struct memblock { bool bottom_up; phys_addr_t current_limit; struct memblock_type memory; struct memblock_type reserved; #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP struct memblock_type physmem; #endif }; ``` and describes generic memory block. You can see that we initialize `_rgn` by assigning it to the address of the `memblock.reserved`. `memblock` is the global variable which looks: ```C struct memblock memblock __initdata_memblock = { .memory.regions = memblock_memory_init_regions, .memory.cnt = 1, .memory.max = INIT_MEMBLOCK_REGIONS, .reserved.regions = memblock_reserved_init_regions, .reserved.cnt = 1, .reserved.max = INIT_MEMBLOCK_REGIONS, #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP .physmem.regions = memblock_physmem_init_regions, .physmem.cnt = 1, .physmem.max = INIT_PHYSMEM_REGIONS, #endif .bottom_up = false, .current_limit = MEMBLOCK_ALLOC_ANYWHERE, }; ``` We will not dive into detail of this variable, but we will see all details about it in the parts about memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is: ```C #define __initdata_memblock __meminitdata ``` and `__meminit_data` is: ```C #define __meminitdata __section(.meminit.data) ``` From this we can conclude that all memory blocks will be in the `.meminit.data` section. After we defined `_rgn` we print information about it with `memblock_dbg` macros. You can enable it by passing `memblock=debug` to the kernel command line. After debugging lines were printed next is the call of the following function: ```C memblock_add_range(_rgn, base, size, nid, flags); ``` which adds new memory block region into the `.meminit.data` section. As we do not initialize `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags: ```C if (type->regions[0].size == 0) { WARN_ON(type->cnt != 1 || type->total_size); type->regions[0].base = base; type->regions[0].size = size; type->regions[0].flags = flags; memblock_set_region_node(&type->regions[0], nid); type->total_size = size; return 0; } ``` After we filled our region we can see the call of the `memblock_set_region_node` function with two parameters: * address of the filled memory region; * NUMA node id. where our regions represented by the `memblock_region` structure: ```C struct memblock_region { phys_addr_t base; phys_addr_t size; unsigned long flags; #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int nid; #endif }; ``` NUMA node id depends on `MAX_NUMNODES` macro which is defined in the [include/linux/numa.h](https://github.com/torvalds/linux/blob/master/include/linux/numa.h): ```C #define MAX_NUMNODES (1 << NODES_SHIFT) ``` where `NODES_SHIFT` depends on `CONFIG_NODES_SHIFT` configuration parameter and defined as: ```C #ifdef CONFIG_NODES_SHIFT #define NODES_SHIFT CONFIG_NODES_SHIFT #else #define NODES_SHIFT 0 #endif ``` `memblick_set_region_node` function just fills `nid` field from `memblock_region` with the given value: ```C static inline void memblock_set_region_node(struct memblock_region *r, int nid) { r->nid = nid; } ``` After this we will have first reserved `memblock` for the extended bios data area in the `.meminit.data` section. `reserve_ebda_region` function finished its work on this step and we can go back to the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). We finished all preparations before the kernel entry point! The last step in the `x86_64_start_reservations` function is the call of the: ```C start_kernel() ``` function from [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) file. That's all for this part. Conclusion -------------------------------------------------------------------------------- It is the end of the third part about linux kernel insides. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see launch of the first `init` process. If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). **Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** Links -------------------------------------------------------------------------------- * [BIOS data area](http://stanislavs.org/helppc/bios_data_area.html) * [What is in the extended BIOS data area on a PC?](http://www.kryslix.com/nsfaq/Q.6.html) * [Previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md)