From 4830c64aeb62b9f71357c1d6f5f5accbd69e57c2 Mon Sep 17 00:00:00 2001 From: 0xAX <0xAX@users.noreply.github.com> Date: Mon, 28 Dec 2015 17:15:16 +0600 Subject: [PATCH 1/5] Update linux-bootstrap-1.md --- Booting/linux-bootstrap-1.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Booting/linux-bootstrap-1.md b/Booting/linux-bootstrap-1.md index c9f36bd..20bebfa 100644 --- a/Booting/linux-bootstrap-1.md +++ b/Booting/linux-bootstrap-1.md @@ -388,7 +388,7 @@ This can lead to 3 different scenarios: Let's look at all three of these scenarios: -1. `ss` has a correct address (0x10000). In this case we go to label [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481): +* `ss` has a correct address (0x10000). In this case we go to label [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481): ``` 2: andw $~3, %dx @@ -403,7 +403,7 @@ Here we can see the alignment of `dx` (contains `sp` given by bootloader) to 4 b ![stack](http://oi58.tinypic.com/16iwcis.jpg) -2. In the second scenario, (`ss` != `ds`). First of all put the [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (address of end of setup code) value in `dx` and check the `loadflags` header field with the `testb` instruction to see whether we can use the heap or not. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as: +* In the second scenario, (`ss` != `ds`). First of all put the [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (address of end of setup code) value in `dx` and check the `loadflags` header field with the `testb` instruction to see whether we can use the heap or not. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as: ```C #define LOADED_HIGH (1<<0) @@ -429,7 +429,7 @@ If the `CAN_USE_HEAP` bit is set, put `heap_end_ptr` in `dx` which points to `_e ![stack](http://oi62.tinypic.com/dr7b5w.jpg) -3. When `CAN_USE_HEAP` is not set, we just use a minimal stack from `_end` to `_end + STACK_SIZE`: +* When `CAN_USE_HEAP` is not set, we just use a minimal stack from `_end` to `_end + STACK_SIZE`: ![minimal stack](http://oi60.tinypic.com/28w051y.jpg) From 17351ed45a64e7c5caa595dc012297cea8690137 Mon Sep 17 00:00:00 2001 From: 0xAX <0xAX@users.noreply.github.com> Date: Tue, 29 Dec 2015 13:24:37 +0600 Subject: [PATCH 2/5] Update linux-bootstrap-4.md --- Booting/linux-bootstrap-4.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Booting/linux-bootstrap-4.md b/Booting/linux-bootstrap-4.md index 519c468..03c98cf 100644 --- a/Booting/linux-bootstrap-4.md +++ b/Booting/linux-bootstrap-4.md @@ -389,7 +389,7 @@ Now we can build the top level page table - `PML4` - with: Here we get the address stored in the `ebx` with `pgtable` offset and put it in `edi`. Next we put this address with offset `0x1007` in the `eax` register. `0x1007` is 4096 bytes (size of the PML4) + 7 (PML4 entry flags - `PRESENT+RW+USER`) and puts `eax` in `edi`. After this manipulation `edi` will contain the address of the first Page Directory Pointer Entry with flags - `PRESENT+RW+USER`. -In the next step we build 4 Page Directory entries in the Page Directory Pointer table, where the first entry will be with `0x7` flags and the others with `0x8`: +In the next step we build 4 Page Directory entries in the Page Directory Pointer table, where the first entry will be with `0x7` flags or present, write, userspace (`PRESENT WRITE | USER`): ```assembly leal pgtable + 0x1000(%ebx), %edi @@ -404,7 +404,7 @@ In the next step we build 4 Page Directory entries in the Page Directory Pointer We put the base address of the page directory pointer table in `edi` and the address of the first page directory pointer entry in `eax`. Put `4` in the `ecx` register, it will be a counter in the following loop and write the address of the first page directory pointer table entry to the `edi` register. -After this `edi` will contain the address of the first page directory pointer entry with flags `0x7`. Next we just calculate the address of following page directory pointer entries with flags `0x8` and write their addresses to `edi`. +After this `edi` will contain the address of the first page directory pointer entry with flags `0x7`. Next we just calculate the address of following page directory pointer entries where each entry is 8 bytes, and write their addresses to `eax`. The next step is building the `2048` page table entries by 2 megabytes: @@ -419,7 +419,7 @@ The next step is building the `2048` page table entries by 2 megabytes: jnz 1b ``` -Here we do almost the same as in the previous example, except the first entry will be with flags - `$0x00000183` - `PRESENT + WRITE + MBZ` and all other entries with `0x8`. In the end we will have 2048 pages by 2 megabytes. +Here we do almost the same as in the previous example, except the first entry will be with flags - `$0x00000183` - `PRESENT + WRITE + MBZ` and all other entries. In the end we will have 2048 pages by 2 megabytes. Our early page table structure are done, it maps 4 gigabytes of memory and now we can put the address of the high-level page table - `PML4` - in `cr3` control register: From 8deab213d99850997daeb6372230137a6569f31d Mon Sep 17 00:00:00 2001 From: zhaoxiaoqiang Date: Wed, 30 Dec 2015 15:36:39 +0800 Subject: [PATCH 3/5] refine statements about early page table build --- Booting/linux-bootstrap-4.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Booting/linux-bootstrap-4.md b/Booting/linux-bootstrap-4.md index 03c98cf..6dbb5c3 100644 --- a/Booting/linux-bootstrap-4.md +++ b/Booting/linux-bootstrap-4.md @@ -389,7 +389,7 @@ Now we can build the top level page table - `PML4` - with: Here we get the address stored in the `ebx` with `pgtable` offset and put it in `edi`. Next we put this address with offset `0x1007` in the `eax` register. `0x1007` is 4096 bytes (size of the PML4) + 7 (PML4 entry flags - `PRESENT+RW+USER`) and puts `eax` in `edi`. After this manipulation `edi` will contain the address of the first Page Directory Pointer Entry with flags - `PRESENT+RW+USER`. -In the next step we build 4 Page Directory entries in the Page Directory Pointer table, where the first entry will be with `0x7` flags or present, write, userspace (`PRESENT WRITE | USER`): +In the next step we build 4 Page Directory entries in the Page Directory Pointer table with `0x7` flags or present, write, userspace (`PRESENT WRITE | USER`): ```assembly leal pgtable + 0x1000(%ebx), %edi @@ -406,7 +406,7 @@ We put the base address of the page directory pointer table in `edi` and the add After this `edi` will contain the address of the first page directory pointer entry with flags `0x7`. Next we just calculate the address of following page directory pointer entries where each entry is 8 bytes, and write their addresses to `eax`. -The next step is building the `2048` page table entries by 2 megabytes: +The next step is building the `2048` page table entries with 2-MByte page: ```assembly leal pgtable + 0x2000(%ebx), %edi @@ -419,7 +419,7 @@ The next step is building the `2048` page table entries by 2 megabytes: jnz 1b ``` -Here we do almost the same as in the previous example, except the first entry will be with flags - `$0x00000183` - `PRESENT + WRITE + MBZ` and all other entries. In the end we will have 2048 pages by 2 megabytes. +Here we do almost the same as in the previous example, all entries will be with flags - `$0x00000183` - `PRESENT + WRITE + MBZ`. In the end we will have 2048 pages with 2-MByte page. Our early page table structure are done, it maps 4 gigabytes of memory and now we can put the address of the high-level page table - `PML4` - in `cr3` control register: From 9283eb3ab1efcac1e6663ddf84ae826245fa86b2 Mon Sep 17 00:00:00 2001 From: spacewander Date: Wed, 6 Jan 2016 21:46:56 +0800 Subject: [PATCH 4/5] fix typo from interrupts-2 --- interrupts/interrupts-2.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/interrupts/interrupts-2.md b/interrupts/interrupts-2.md index 9f58ea1..ec2996d 100644 --- a/interrupts/interrupts-2.md +++ b/interrupts/interrupts-2.md @@ -129,7 +129,7 @@ We pass `irq_stack_union` symbol to the `INIT_PER_CPU_VAR` macro which just conc INIT_PER_CPU(irq_stack_union); ``` -It tells us that the address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where `init_per_cpu__irq_stack_union` and `__per_cpu_load` are and what they mean. The first `irq_stack_union` is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call the `init_per_cpu_var` macro: +It tells us that the address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where `init_per_cpu__irq_stack_union` and `__per_cpu_load` are what they mean. The first `irq_stack_union` is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call the `init_per_cpu_var` macro: ```C DECLARE_INIT_PER_CPU(irq_stack_union); @@ -266,7 +266,7 @@ union irq_stack_union { }; ``` -which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [unioun](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro). +which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro). After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter): @@ -304,7 +304,7 @@ This macro defined in the [include/linux/irqflags.h](https://github.com/torvalds #endif ``` -They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `stoftirq` state. In ourcase `lockdep` subsytem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c): +They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `stoftirq` state. In our case `lockdep` subsytem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c): ```C void trace_hardirqs_off(void) @@ -314,7 +314,7 @@ void trace_hardirqs_off(void) EXPORT_SYMBOL(trace_hardirqs_off); ``` -and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` filed of the current process increment the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_insides.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_insides.h) and located in the `lockdep_stats` structure: +and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` field of the current process and increases the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_insides.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_insides.h) and located in the `lockdep_stats` structure: ```C struct lockdep_stats { @@ -371,7 +371,7 @@ static inline void native_irq_disable(void) } ``` -And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle and interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macr - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction: +And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle an interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macro - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction: ```C static inline void native_irq_enable(void) From 90c764c91495f1a43957b0d68108e29394c3a896 Mon Sep 17 00:00:00 2001 From: spacewander Date: Thu, 7 Jan 2016 18:17:15 +0800 Subject: [PATCH 5/5] fix interrupts/interrupts-3 --- interrupts/interrupts-3.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/interrupts/interrupts-3.md b/interrupts/interrupts-3.md index afde576..db2eb61 100644 --- a/interrupts/interrupts-3.md +++ b/interrupts/interrupts-3.md @@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 3. Interrupt handlers -------------------------------------------------------------------------------- -This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stoped in the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) on the setting of the two exceptions handlers for the two following exceptions: +This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stopped in the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) on the setting of the two exceptions handlers for the two following exceptions: * `#DB` - debug exception, transfers control from the interrupted process to the debug handler; * `#BP` - breakpoint exception, caused by the `int 3` instruction. @@ -104,14 +104,14 @@ As you can note, the `set_intr_gate_ist` and `set_system_intr_gate_ist` function * `&debug`; * `&int3`. -You will not find these functions in the C code. All that can be found in in the `*.c/*.h` files only definition of this functions in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h): +You will not find these functions in the C code. All that can be found in the `*.c/*.h` files only definition of this functions in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h): ```C asmlinkage void debug(void); asmlinkage void int3(void); ``` -But we can see `asmlinkage` descriptor here. The `asmlinkage` is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are will be called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack. So, both handlers are defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file with the `idtentry` macro: +But we can see `asmlinkage` descriptor here. The `asmlinkage` is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack. So, both handlers are defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file with the `idtentry` macro: ```assembly idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK @@ -199,7 +199,7 @@ The `pushq_cfi` macro defined in the [arch/x86/include/asm/dwarf2.h](https://git .endm ``` -Pay attention on the `$-1`. We already know that when an exception occrus, the processor pushes `ss`, `rsp`, `rflags`, `cs` and `rip` on the stack: +Pay attention on the `$-1`. We already know that when an exception occurs, the processor pushes `ss`, `rsp`, `rflags`, `cs` and `rip` on the stack: ```C #define RIP 16*8 @@ -239,14 +239,14 @@ After this we check `paranoid` and if it is set we check first three `CPL` bits. .endif ``` -If we came from userspace we jump on the label `1` which starts from the `call error_entry` instruction. The `error_entry` saves all registers in the `pt_regs` structure which presetens an interrupt/exception stack frame and defined in the [arch/x86/include/uapi/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h). It saves common and extra registers on the stack with the: +If we came from userspace we jump on the label `1` which starts from the `call error_entry` instruction. The `error_entry` saves all registers in the `pt_regs` structure which presents an interrupt/exception stack frame and defined in the [arch/x86/include/uapi/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h). It saves common and extra registers on the stack with the: ```assembly SAVE_C_REGS 8 SAVE_EXTRA_REGS 8 ``` -from `rdi` to `r15` and executes [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. This instruction provides a method to for the Linux kernel to obtain a pointer to the kernel data structures and save the user's `gsbase`. After this we will exit from the `error_entry` with the `ret` instruction. After the `error_entry` finished to execute, since we came from userspace we need to switch on kernel interrupt stack: +from `rdi` to `r15` and executes [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. This instruction provides a method for the Linux kernel to obtain a pointer to the kernel data structures and save the user's `gsbase`. After this we will exit from the `error_entry` with the `ret` instruction. After the `error_entry` finished to execute, since we came from userspace we need to switch on kernel interrupt stack: ```assembly movq %rsp,%rdi @@ -277,7 +277,7 @@ and put pointer of the `pt_regs` again in the `rdi`, and in the last step we cal call \do_sym ``` -So, realy exceptions handlers are `do_debug` and `do_int3` functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if `paranoid` is set, but it is not 1, we call `paranoid_entry` which makes almost the same that `error_entry`, but it checks current mode with more slow but accurate way: +So, real exceptions handlers are `do_debug` and `do_int3` functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if `paranoid` is set, but it is not 1, we call `paranoid_entry` which makes almost the same that `error_entry`, but it checks current mode with more slow but accurate way: ```assembly ENTRY(paranoid_entry) @@ -434,9 +434,9 @@ exit: ist_exit(regs, prev_state); ``` -In the end we disabled `irqs`, decrement value of the `debug_stack_usage` and exit from the exception handler with the `ist_exit` function. +In the end we disable `irqs`, decrease value of the `debug_stack_usage` and exit from the exception handler with the `ist_exit` function. -The second exception handler is `do_int3` defined in the same source code file - [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). In the `do_int3` we makes almost the same that in the `do_debug` handler. We get the previous state with the `ist_enter`, increment and decrement the `debug_stack_usage` per-cpu variable, enabled and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and than sync processor cores during breakpoint patching. +The second exception handler is `do_int3` defined in the same source code file - [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). In the `do_int3` we make almost the same that in the `do_debug` handler. We get the previous state with the `ist_enter`, increase and decrease the `debug_stack_usage` per-cpu variable, enable and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and then sync processor cores during breakpoint patching. That's all.