linux-insides/Booting/linux-bootstrap-5.md

25 KiB

Kernel booting process. Part 5.

Kernel decompression

This is the fifth part of the Kernel booting process series. We saw transition to the 64-bit mode in the previous part and we will continue from this point in this part. We will see the last steps before we jump to the kernel code as preparation for kernel decompression, relocation and directly kernel decompression. So... let's start to dive in the kernel code again.

Preparation before kernel decompression

We stopped right before the jump on the 64-bit entry point - startup_64 which is located in the arch/x86/boot/compressed/head_64.S source code file. We already saw the jump to the startup_64 in the startup_32:

	pushl	$__KERNEL_CS
	leal	startup_64(%ebp), %eax
	...
	...
	...
	pushl	%eax
	...
	...
	...
	lret

in the previous part, startup_64 starts to work. Since we loaded the new Global Descriptor Table and there was CPU transition in other mode (64-bit mode in our case), we can see the setup of the data segments:

	.code64
	.org 0x200
ENTRY(startup_64)
	xorl	%eax, %eax
	movl	%eax, %ds
	movl	%eax, %es
	movl	%eax, %ss
	movl	%eax, %fs
	movl	%eax, %gs

in the beginning of the startup_64. All segment registers besides cs now point to the ds which is 0x18 (if you don't understand why it is 0x18, read the previous part).

The next step is computation of difference between where the kernel was compiled and where it was loaded:

#ifdef CONFIG_RELOCATABLE
	leaq	startup_32(%rip), %rbp
	movl	BP_kernel_alignment(%rsi), %eax
	decl	%eax
	addq	%rax, %rbp
	notq	%rax
	andq	%rax, %rbp
	cmpq	$LOAD_PHYSICAL_ADDR, %rbp
	jge	1f
#endif
	movq	$LOAD_PHYSICAL_ADDR, %rbp
1:
	leaq	z_extract_offset(%rbp), %rbx

rbp contains the decompressed kernel start address and after this code executes rbx register will contain address to relocate the kernel code for decompression. We already saw code like this in the startup_32 ( you can read about it in the previous part - Calculate relocation address), but we need to do this calculation again because the bootloader can use 64-bit boot protocol and startup_32 just will not be executed in this case.

In the next step we can see setup of the stack pointer and reseting of the flags register:

	leaq	boot_stack_end(%rbx), %rsp

 	pushq	$0
	popfq

As you can see above, the rbx register contains the start address of the kernel decompressor code and we just put this address with boot_stack_end offset to the rsp register which represents pointer to the top of the stack. After this step, the stack will be correct. You can find definition of the boot_stack_end in the end of arch/x86/boot/compressed/head_64.S assembly source code file:

	.bss
	.balign 4
boot_heap:
	.fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
	.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:

It located in the end of the .bss section, right before the .pgtable. If you will look into arch/x86/boot/compressed/vmlinux.lds.S linker script, you will find Definition of the .bss and .pgtable there.

As we set the stack, now we can copy the compressed kernel to the address that we got above, when we calculated the relocation address of the decompressed kernel. Before details, let's look at this assembly code:

	pushq	%rsi
	leaq	(_bss-8)(%rip), %rsi
	leaq	(_bss-8)(%rbx), %rdi
	movq	$_bss, %rcx
	shrq	$3, %rcx
	std
	rep	movsq
	cld
	popq	%rsi

First of all we push rsi to the stack. We need preserve the value of rsi, because this register now stores a pointer to the boot_params which is real mode structure that contains booting related data (you must remember this structure, we filled it in the start of kernel setup). In the end of this code we'll restore the pointer to the boot_params into rsi again.

The next two leaq instructions calculates effective addresses of the rip and rbx with _bss - 8 offset and put it to the rsi and rdi. Why do we calculate these addresses? Actually the compressed kernel image is located between this copying code (from startup_32 to the current code) and the decompression code. You can verify this by looking at the linker script - arch/x86/boot/compressed/vmlinux.lds.S:

	. = 0;
	.head.text : {
		_head = . ;
		HEAD_TEXT
		_ehead = . ;
	}
	.rodata..compressed : {
		*(.rodata..compressed)
	}
	.text :	{
		_text = .; 	/* Text */
		*(.text)
		*(.text.*)
		_etext = . ;
	}

Note that .head.text section contains startup_32. You can remember it from the previous part:

	__HEAD
	.code32
ENTRY(startup_32)
...
...
...

The .text section contains decompression code:

	.text
relocated:
...
...
...
/*
 * Do the decompression, and jump to the new kernel..
 */
...

And .rodata..compressed contains the compressed kernel image. So the rsi will contain rip relative address of the _bss - 8 and rdi will contain relocation relative address of the _bss - 8. As we store these addresses in registers, we put the address of _bss to the rcx register. As you can see in the vmlinux.lds.S linker script, it located in the end of all sections with the setup/kernel code. Now we can start to copy data from the rsi to rdi by 8 bytes with movsq instruction.

Note that there is std instruction before data copying, it sets DF flag and it means that rsi and rdi will be decremented or in other words, we will copy bytes in backwards. In the end we clear DF flag with cld instruction and restore boot_params structure to the rsi.

Now we have the address of the .text section address after relocation and we can jump to it:

	leaq	relocated(%rbx), %rax
	jmp	*%rax

Last preparation before kernel decompression

In the previous paragraph we saw that the .text section starts with the relocated label. For the start there is clearing of the bss section with:

	xorl	%eax, %eax
	leaq    _bss(%rip), %rdi
	leaq    _ebss(%rip), %rcx
	subq	%rdi, %rcx
	shrq	$3, %rcx
	rep	stosq

We need to initialze the .bss section, because soon we will jump to the C code. Here we just clear eax, put RIP relative address of the _bss to the rdi and _ebss to rcx and fill it with zeros with rep stosq instructions.

In the end we can see the call of the decompress_kernel routine:

	pushq	%rsi
	movq	$z_run_size, %r9
	pushq	%r9
	movq	%rsi, %rdi
	leaq	boot_heap(%rip), %rsi
	leaq	input_data(%rip), %rdx
	movl	$z_input_len, %ecx
	movq	%rbp, %r8
	movq	$z_output_len, %r9
	call	decompress_kernel
	popq	%r9
	popq	%rsi

Again we save rsi with a pointer to the boot_params structure and call decompress_kernel from the arch/x86/boot/compressed/misc.c with seven arguments:

  • boot_param - pointer to the boot_params structure which is filled by bootloader or during early kernel initialzation;
  • heap - pointer to the boot_heap which represents start address of the early boot heap;
  • input_data - pointer to the start of the compressed kernel or in other words pointer to the arch/x86/boot/compressed/vmlinux.bin.bz2;
  • input_len - size of the compressed kernel;
  • output - start address of the future decompressed kernel;
  • output_len - size of decompressed kernel;
  • run_size - amount of space needed to run the kernel including .bss and .brk sections.

All arguments will be passed through the registers according to System V Application Binary Interface. We finished all preparation and now can look on the kernel decompression.

Kernel decompression

As we saw in previous paragraph, the decompress_kernel function is defined in the arch/x86/boot/compressed/misc.c source code file and takes seven arguments. This function starts with the video/console initialization that we already saw in the previous parts. Again, we need to do this because we don't know, do we started in the real mode or a bootloader used 32 or 64-bit boot protocols.

After the first initialization steps, we store pointers to the start of the free memory and to the end of it:

free_mem_ptr     = heap;
free_mem_end_ptr = heap + BOOT_HEAP_SIZE;

where the heap is the second parameter of the decompress_kernel function which we got in the arch/x86/boot/compressed/head_64.S:

leaq	boot_heap(%rip), %rsi

As you saw above, the boot_heap is defined as:

boot_heap:
	.fill BOOT_HEAP_SIZE, 1, 0

where the BOOT_HEAP_SIZE is macro which expands to the 0x400000 (in a case of bzip2 kernel and 0x8000 in other cases) value and represents size of the heap.

After heap pointers initialzation, the next step is the call of the choose_kernel_location function from arch/x86/boot/compressed/aslr.c source code file. As we can understand from the function name it chooses the memory location where the kernel image will be decompressed. I know, it may look weird, that we need to find or even choose location where to decompress the compressed kernel image. But actuall the Linux kernel supports kASLR feature which in simple words allows to decompress the kernel into random address for security reasons. Let's open the arch/x86/boot/compressed/aslr.c source code file and will look at the choose_kernel_location implementation.

At the start choose_kernel_location tries to find kaslr option in the Linux kernel command line if the CONFIG_HIBERNATION is set and nokaslr option if this configuration option otherwise:

#ifdef CONFIG_HIBERNATION
	if (!cmdline_find_option_bool("kaslr")) {
		debug_putstr("KASLR disabled by default...\n");
		goto out;
	}
#else
	if (cmdline_find_option_bool("nokaslr")) {
		debug_putstr("KASLR disabled by cmdline...\n");
		goto out;
	}
#endif

If the CONFIG_HIBERNATION kernel configuration option is enabled during kernel configuration and if there is no kASLR option in the Linux kernel command line, we will see KASLR disabled by default... output and will jump to the out label:

out:
	return (unsigned char *)choice;

which just returns the output parameter which we passed to the choose_kernel_location without any changes. In other case, if the CONFIG_HIBERNATION kernel configuration option is disabled and the nokaslr option is in the kernel command line we do the same that in previous condition.

For now, let's suppose that kernel was configured with enabled randomization and try to understand what kASLR is. We can find information about it in the documentation:

kaslr/nokaslr [X86]

Enable/disable kernel and module base offset ASLR
(Address Space Layout Randomization) if built into
the kernel. When CONFIG_HIBERNATION is selected,
kASLR is disabled by default. When kASLR is enabled,
hibernation will be disabled.

It means that we can pass the kaslr option to the kernel's command line and get a random address for the decompressed kernel (you can read more about aslr here). So, our current goal is to find random address where we can safely to decompress the Linux kernel. I'm not in vain wrote - safely. What does it mean in this context? You may remember that besides the code of decompressor and directly the kernel image, there are some unsafe places in memory. For example initrd image is in memory too and we must not overlap it by the decompressed kernel.

The next function will help us to find safe place where we can decompress kernel. This function is the - mem_avoid_init. It defined in the same source code file and takes four arguments that we already saw in the decompress_kernel function:

  • input_data - pointer to the start of the compressed kernel or in other words pointer to the arch/x86/boot/compressed/vmlinux.bin.bz2;
  • input_len - size of the compressed kernel;
  • output - start address of the future decompressed kernel;
  • output_len - size of decompressed kernel.

The main point of this function is to fill array of the mem_vector structures:

#define MEM_AVOID_MAX 5

static struct mem_vector mem_avoid[MEM_AVOID_MAX];

where the mem_vector structure contains information about unsafe memory regions:

struct mem_vector {
	unsigned long start;
	unsigned long size;
};

The implementation of the mem_avoid_init is pretty simple. Let's look on the part of this function:

    ...
    ...
    ...
	initrd_start  = (u64)real_mode->ext_ramdisk_image << 32;
	initrd_start |= real_mode->hdr.ramdisk_image;
	initrd_size  = (u64)real_mode->ext_ramdisk_size << 32;
	initrd_size |= real_mode->hdr.ramdisk_size;
	mem_avoid[1].start = initrd_start;
	mem_avoid[1].size = initrd_size;
    ...
    ...
    ...

Here we can see calculation of the initrd start address and size. The ext_ramdisk_image is high 32-bits of the ramdisk_image field from the setup header and ext_ramdisk_size is high 32-bits of the ramdisk_size field from boot protocol:

Offset	Proto	Name		Meaning
/Size
...
...
...
0218/4	2.00+	ramdisk_image	initrd load address (set by boot loader)
021C/4	2.00+	ramdisk_size	initrd size (set by boot loader)
...

And ext_ramdisk_image and ext_ramdisk_size you can find in the Documentation/x86/zero-page.txt:

Offset	Proto	Name		Meaning
/Size
...
...
...
0C0/004	ALL	ext_ramdisk_image ramdisk_image high 32bits
0C4/004	ALL	ext_ramdisk_size  ramdisk_size high 32bits
...

So we're taking ext_ramdisk_image and ext_ramdisk_size, shifting them left on 32 (now they will contain low 32-bits in the high 32-bit bits) and getting start address of the initrd and size of it. After this we store these values in the mem_avoid array.

The next step after we collected all unsafe memory regions in the mem_avoid array will be searching for the random address which does not overlap with the unsafe regions with the find_random_addr function. First of all we can see align of the output address in the find_random_addr function:

minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);

You can remember CONFIG_PHYSICAL_ALIGN configuration option from the previous part. This option provides the value to which kernel should be aligned and it is 0x200000 by default. Once we have the aligned output address, we go through the memory regions which we got with the help of the BIOS e820 service and collect regions which are good for decompressed kernel image:

for (i = 0; i < real_mode->e820_entries; i++) {
	process_e820_entry(&real_mode->e820_map[i], minimum, size);
}

Recall that we collected e820_entries in the second part of the Kernel booting process part 2. The process_e820_entry function does some checks that an e820 memory region is not non-RAM, that the start address of the memory region is not bigger than maximum allowed aslr offset and that memory region is not less than value of kernel alignment:

struct mem_vector region, img;

if (entry->type != E820_RAM)
	return;

if (entry->addr >= CONFIG_RANDOMIZE_BASE_MAX_OFFSET)
	return;

if (entry->addr + entry->size < minimum)
	return;

After this, we store an e820 memory region start address and the size in the mem_vector structure (we saw definition of this structure above):

region.start = entry->addr;
region.size = entry->size;

As we store these values, we align the region.start as we did it in the find_random_addr function and check that we didn't get an address that is bigger than original memory region:

region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);

if (region.start > entry->addr + entry->size)
	return;

In the next step we need to get the difference between the original address and aligned and check that if the last address in the memory region is bigger than CONFIG_RANDOMIZE_BASE_MAX_OFFSET, we reduce the memory region size so that the end of the kernel image will be less than the maximum aslr offset:

region.size -= region.start - entry->addr;

if (region.start + region.size > CONFIG_RANDOMIZE_BASE_MAX_OFFSET)
		region.size = CONFIG_RANDOMIZE_BASE_MAX_OFFSET - region.start;

In the end we go through all unsafe memory regions and check that each region does not overlap unsafe ares with kernel command line, initrd and etc...:

for (img.start = region.start, img.size = image_size ;
	     mem_contains(&region, &img) ;
	     img.start += CONFIG_PHYSICAL_ALIGN) {
		if (mem_avoid_overlap(&img))
			continue;
		slots_append(img.start);
	}

If the memory region does not overlap unsafe regions we call the slots_append function with the start address of the region. slots_append function just collects start addresses of memory regions to the slots array:

slots[slot_max++] = addr;

which is defined as:

static unsigned long slots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET /
			   CONFIG_PHYSICAL_ALIGN];
static unsigned long slot_max;

After process_e820_entry will be executed, we will have an array of the addresses which are safe for the decompressed kernel. Next we call slots_fetch_random function for getting random item from this array:

if (slot_max == 0)
	return 0;

return slots[get_random_long() % slot_max];

where get_random_long function checks different CPU flags as X86_FEATURE_RDRAND or X86_FEATURE_TSC and chooses method for getting random number (it can be obtain with RDRAND instruction, Time stamp counter, programmable interval timer and etc...). After retrieving the random address execution of the choose_kernel_location is finished.

Now let's back to the misc.c. After getting the address for the kernel image, there need to be some checks to be sure that the retrieved random address is correctly aligned and address is not wrong.

After all these checks will see the familiar message:

Decompressing Linux... 

and call the __decompress function which will decompress the kernel. The __decompress function depends on what decompression algorithm was chosen during kernel compilation:

#ifdef CONFIG_KERNEL_GZIP
#include "../../../../lib/decompress_inflate.c"
#endif

#ifdef CONFIG_KERNEL_BZIP2
#include "../../../../lib/decompress_bunzip2.c"
#endif

#ifdef CONFIG_KERNEL_LZMA
#include "../../../../lib/decompress_unlzma.c"
#endif

#ifdef CONFIG_KERNEL_XZ
#include "../../../../lib/decompress_unxz.c"
#endif

#ifdef CONFIG_KERNEL_LZO
#include "../../../../lib/decompress_unlzo.c"
#endif

#ifdef CONFIG_KERNEL_LZ4
#include "../../../../lib/decompress_unlz4.c"
#endif

After kernel will be decompressed, the last two functions are the parse_elf and the handle_relocations. The main point of these function is to move the uncompressed kernel image to the correct memory place. The fact is that the decompression will decompress compressed part in-place and we still need to move kernel to the correct address. As we already know, the kernel image is ELF executable, so the main goal of the parse_elf function is to move loadable segments to the correct address. We can see loadable segments in the output of the readelf util:

readelf -l vmlinux

Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000200000 0xffffffff81000000 0x0000000001000000
                 0x0000000000893000 0x0000000000893000  R E    200000
  LOAD           0x0000000000a93000 0xffffffff81893000 0x0000000001893000
                 0x000000000016d000 0x000000000016d000  RW     200000
  LOAD           0x0000000000c00000 0x0000000000000000 0x0000000001a00000
                 0x00000000000152d8 0x00000000000152d8  RW     200000
  LOAD           0x0000000000c16000 0xffffffff81a16000 0x0000000001a16000
                 0x0000000000138000 0x000000000029b000  RWE    200000

The goal of the parse_elf function is to load these segments to the output address that we got from the choose_kernel_location function. This function starts from the checkking of the ELF signature:

Elf64_Ehdr ehdr;
Elf64_Phdr *phdrs, *phdr;

memcpy(&ehdr, output, sizeof(ehdr));

if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
   ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
   ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
   ehdr.e_ident[EI_MAG3] != ELFMAG3) {
   error("Kernel is not a valid ELF file");
   return;
}

and if it does not valid it prints error message and halt. If we got a valid ELF file, copy go through all program headers from the given ELF file and copies all loadable segments with correct address to the output buffer:

	for (i = 0; i < ehdr.e_phnum; i++) {
		phdr = &phdrs[i];

		switch (phdr->p_type) {
		case PT_LOAD:
#ifdef CONFIG_RELOCATABLE
			dest = output;
			dest += (phdr->p_paddr - LOAD_PHYSICAL_ADDR);
#else
			dest = (void *)(phdr->p_paddr);
#endif
			memcpy(dest,
			       output + phdr->p_offset,
			       phdr->p_filesz);
			break;
		default: /* Ignore other PT_* */ break;
		}
	}

That's all. From now all loadable segments are in the correct place. The last handle_relocations function adjusts addresses in the kernel image and called only if the kASLR was enabled during kernel configuration.

After the kernel is relocated we return back from the decompress_kernel to the arch/x86/boot/compressed/head_64.S. The address of the kernel will be in the rax register and we jump to it:

jmp	*%rax

That's all. Now we are in the kernel!

Conclusion

This is the end of the fifth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe only updates in this and previous posts), but there will be many posts about other kernel insides.

Next chapter will be about kernel initialization and we will see the first steps in the linux kernel initialization code.

If you have any questions or suggestions write me a comment or ping me in twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.