commit
2f9fcd5bf6
@ -0,0 +1,2 @@
|
||||
*.tex
|
||||
build
|
@ -1,10 +1,12 @@
|
||||
# Kernel Boot Process
|
||||
|
||||
This chapter describes the linux kernel boot process. Here you will see a
|
||||
couple of posts which describes the full cycle of the kernel loading process:
|
||||
This chapter describes the linux kernel boot process. Here you will see a series of posts which describes the full cycle of the kernel loading process:
|
||||
|
||||
* [From the bootloader to kernel](linux-bootstrap-1.md) - describes all stages from turning on the computer to running the first instruction of the kernel.
|
||||
* [First steps in the kernel setup code](linux-bootstrap-2.md) - describes first steps in the kernel setup code. You will see heap initialization, query of different parameters like EDD, IST and etc...
|
||||
* [Video mode initialization and transition to protected mode](linux-bootstrap-3.md) - describes video mode initialization in the kernel setup code and transition to protected mode.
|
||||
* [Transition to 64-bit mode](linux-bootstrap-4.md) - describes preparation for transition into 64-bit mode and details of transition.
|
||||
* [Kernel Decompression](linux-bootstrap-5.md) - describes preparation before kernel decompression and details of direct decompression.
|
||||
* [Kernel random address randomization](linux-bootstrap-6.md) - describes randomization of the Linux kernel load address.
|
||||
|
||||
This chapter coincides with `Linux kernel v4.17`.
|
||||
|
@ -0,0 +1,421 @@
|
||||
Kernel booting process. Part 6.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the sixth part of the `Kernel booting process` series. In the [previous part](linux-bootstrap-5.md) we have seen the end of the kernel boot process. But we have skipped some important advanced parts.
|
||||
|
||||
As you may remember the entry point of the Linux kernel is the `start_kernel` function from the [main.c](https://github.com/torvalds/linux/blob/v4.16/init/main.c) source code file started to execute at `LOAD_PHYSICAL_ADDR` address. This address depends on the `CONFIG_PHYSICAL_START` kernel configuration option which is `0x1000000` by default:
|
||||
|
||||
```
|
||||
config PHYSICAL_START
|
||||
hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)
|
||||
default "0x1000000"
|
||||
---help---
|
||||
This gives the physical address where the kernel is loaded.
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
This value may be changed during kernel configuration, but also load address can be selected as a random value. For this purpose the `CONFIG_RANDOMIZE_BASE` kernel configuration option should be enabled during kernel configuration.
|
||||
|
||||
In this case a physical address at which Linux kernel image will be decompressed and loaded will be randomized. This part considers the case when this option is enabled and load address of the kernel image will be randomized for [security reasons](https://en.wikipedia.org/wiki/Address_space_layout_randomization).
|
||||
|
||||
Initialization of page tables
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Before the kernel decompressor will start to find random memory range where the kernel will be decompressed and loaded, the identity mapped page tables should be initialized. If a [bootloader](https://en.wikipedia.org/wiki/Booting) used [16-bit or 32-bit boot protocol](https://github.com/torvalds/linux/blob/v4.16/Documentation/x86/boot.txt), we already have page tables. But in any case, we may need new pages by demand if the kernel decompressor selects memory range outside of them. That's why we need to build new identity mapped page tables.
|
||||
|
||||
Yes, building of identity mapped page tables is the one of the first step during randomization of load address. But before we will consider it, let's try to remember where did we come from to this point.
|
||||
|
||||
In the [previous part](linux-bootstrap-5.md), we saw transition to [long mode](https://en.wikipedia.org/wiki/Long_mode) and jump to the kernel decompressor entry point - `extract_kernel` function. The randomization stuff starts here from the call of the:
|
||||
|
||||
```C
|
||||
void choose_random_location(unsigned long input,
|
||||
unsigned long input_size,
|
||||
unsigned long *output,
|
||||
unsigned long output_size,
|
||||
unsigned long *virt_addr)
|
||||
{}
|
||||
```
|
||||
|
||||
function. As you may see, this function takes following five parameters:
|
||||
|
||||
* `input`;
|
||||
* `input_size`;
|
||||
* `output`;
|
||||
* `output_isze`;
|
||||
* `virt_addr`.
|
||||
|
||||
Let's try to understand what these parameters are. The first `input` parameter came from parameters of the `extract_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/misc.c) source code file:
|
||||
|
||||
```C
|
||||
asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
|
||||
unsigned char *input_data,
|
||||
unsigned long input_len,
|
||||
unsigned char *output,
|
||||
unsigned long output_len)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
choose_random_location((unsigned long)input_data, input_len,
|
||||
(unsigned long *)&output,
|
||||
max(output_len, kernel_total_size),
|
||||
&virt_addr);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
This parameter is passed from assembler code:
|
||||
|
||||
```C
|
||||
leaq input_data(%rip), %rdx
|
||||
```
|
||||
|
||||
from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S). The `input_data` is generated by the little [mkpiggy](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/mkpiggy.c) program. If you have compiled linux kernel source code under your hands, you may find the generated file by this program which should be placed in the `linux/arch/x86/boot/compressed/piggy.S`. In my case this file looks:
|
||||
|
||||
```assembly
|
||||
.section ".rodata..compressed","a",@progbits
|
||||
.globl z_input_len
|
||||
z_input_len = 6988196
|
||||
.globl z_output_len
|
||||
z_output_len = 29207032
|
||||
.globl input_data, input_data_end
|
||||
input_data:
|
||||
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
|
||||
input_data_end:
|
||||
```
|
||||
|
||||
As you may see it contains four global symbols. The first two `z_input_len` and `z_output_len` which are sizes of compressed and uncompressed `vmlinux.bin.gz`. The third is our `input_data` and as you may see it points to linux kernel image in raw binary format (all debugging symbols, comments and relocation information are stripped). And the last `input_data_end` points to the end of the compressed linux image.
|
||||
|
||||
So, our first parameter of the `choose_random_location` function is the pointer to the compressed kernel image that is embedded into the `piggy.o` object file.
|
||||
|
||||
The second parameter of the `choose_random_location` function is the `z_input_len` that we have seen just now.
|
||||
|
||||
The third and fourth parameters of the `choose_random_location` function are address where to place decompressed kernel image and the length of decompressed kernel image respectively. The address where to put decompressed kernel came from [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S) and it is address of the `startup_32` aligned to 2 megabytes boundary. The size of the decompressed kernel came from the same `piggy.S` and it is `z_output_len`.
|
||||
|
||||
The last parameter of the `choose_random_location` function is the virtual address of the kernel load address. As we may see, by default it coincides with the default physical load address:
|
||||
|
||||
```C
|
||||
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
|
||||
```
|
||||
|
||||
which depends on kernel configuration:
|
||||
|
||||
```C
|
||||
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
|
||||
+ (CONFIG_PHYSICAL_ALIGN - 1)) \
|
||||
& ~(CONFIG_PHYSICAL_ALIGN - 1))
|
||||
```
|
||||
|
||||
Now, as we considered parameters of the `choose_random_location` function, let's look at implementation of it. This function starts from the checking of `nokaslr` option in the kernel command line:
|
||||
|
||||
```C
|
||||
if (cmdline_find_option_bool("nokaslr")) {
|
||||
warn("KASLR disabled: 'nokaslr' on cmdline.");
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
and if the options was given we exit from the `choose_random_location` function ad kernel load address will not be randomized. Related command line options can be found in the [kernel documentation](https://github.com/torvalds/linux/blob/v4.16/Documentation/admin-guide/kernel-parameters.rst):
|
||||
|
||||
```
|
||||
kaslr/nokaslr [X86]
|
||||
|
||||
Enable/disable kernel and module base offset ASLR
|
||||
(Address Space Layout Randomization) if built into
|
||||
the kernel. When CONFIG_HIBERNATION is selected,
|
||||
kASLR is disabled by default. When kASLR is enabled,
|
||||
hibernation will be disabled.
|
||||
```
|
||||
|
||||
Let's assume that we didn't pass `nokaslr` to the kernel command line and the `CONFIG_RANDOMIZE_BASE` kernel configuration option is enabled. In this case we add `kASLR` flag to kernel load flags:
|
||||
|
||||
```C
|
||||
boot_params->hdr.loadflags |= KASLR_FLAG;
|
||||
```
|
||||
|
||||
and the next step is the call of the:
|
||||
|
||||
```C
|
||||
initialize_identity_maps();
|
||||
```
|
||||
|
||||
function which is defined in the [arch/x86/boot/compressed/kaslr_64.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/kaslr_64.c) source code file. This function starts from initialization of `mapping_info` an instance of the `x86_mapping_info` structure:
|
||||
|
||||
```C
|
||||
mapping_info.alloc_pgt_page = alloc_pgt_page;
|
||||
mapping_info.context = &pgt_data;
|
||||
mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sev_me_mask;
|
||||
mapping_info.kernpg_flag = _KERNPG_TABLE;
|
||||
```
|
||||
|
||||
The `x86_mapping_info` structure is defined in the [arch/x86/include/asm/init.h](https://github.com/torvalds/linux/blob/v4.16/arch/x86/include/asm/init.h) header file and looks:
|
||||
|
||||
```C
|
||||
struct x86_mapping_info {
|
||||
void *(*alloc_pgt_page)(void *);
|
||||
void *context;
|
||||
unsigned long page_flag;
|
||||
unsigned long offset;
|
||||
bool direct_gbpages;
|
||||
unsigned long kernpg_flag;
|
||||
};
|
||||
```
|
||||
|
||||
This structure provides information about memory mappings. As you may remember from the previous part, we already setup'ed initial page tables from 0 up to `4G`. For now we may need to access memory above `4G` to load kernel at random position. So, the `initialize_identity_maps` function executes initialization of a memory region for a possible needed new page table. First of all let's try to look at the definition of the `x86_mapping_info` structure.
|
||||
|
||||
The `alloc_pgt_page` is a callback function that will be called to allocate space for a page table entry. The `context` field is an instance of the `alloc_pgt_data` structure in our case which will be used to track allocated page tables. The `page_flag` and `kernpg_flag` fields are page flags. The first represents flags for `PMD` or `PUD` entries. The second `kernpg_flag` field represents flags for kernel pages which can be overridden later. The `direct_gbpages` field represents support for huge pages and the last `offset` field represents offset between kernel virtual addresses and physical addresses up to `PMD` level.
|
||||
|
||||
The `alloc_pgt_page` callback just validates that there is space for a new page, allocates new page:
|
||||
|
||||
```C
|
||||
entry = pages->pgt_buf + pages->pgt_buf_offset;
|
||||
pages->pgt_buf_offset += PAGE_SIZE;
|
||||
```
|
||||
|
||||
in the buffer from the:
|
||||
|
||||
```C
|
||||
struct alloc_pgt_data {
|
||||
unsigned char *pgt_buf;
|
||||
unsigned long pgt_buf_size;
|
||||
unsigned long pgt_buf_offset;
|
||||
};
|
||||
```
|
||||
|
||||
structure and returns address of a new page. The last goal of the `initialize_identity_maps` function is to initialize `pgdt_buf_size` and `pgt_buf_offset`. As we are only in initialization phase, the `initialze_identity_maps` function sets `pgt_buf_offset` to zero:
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf_offset = 0;
|
||||
```
|
||||
|
||||
and the `pgt_data.pgt_buf_size` will be set to `77824` or `69632` depends on which boot protocol will be used by bootloader (64-bit or 32-bit). The same is for `pgt_data.pgt_buf`. If a bootloader loaded the kernel at `startup_32`, the `pgdt_data.pgdt_buf` will point to the end of the page table which already was initialzed in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S):
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
|
||||
```
|
||||
|
||||
where `_pgtable` points to the beginning of this page table [_pgtable](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/vmlinux.lds.S). In other way, if a bootloader have used 64-bit boot protocol and loaded the kernel at `startup_64`, early page tables should be built by bootloader itself and `_pgtable` will be just overwrote:
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf = _pgtable
|
||||
```
|
||||
|
||||
As the buffer for new page tables is initialized, we may return back to the `choose_random_location` function.
|
||||
|
||||
Avoid reserved memory ranges
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the stuff related to identity page tables is initilized, we may start to choose random location where to put decompressed kernel image. But as you may guess, we can't choose any address. There are some reseved addresses in memory ranges. Such addresses occupied by important things, like [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), kernel command line and etc. The
|
||||
|
||||
```C
|
||||
mem_avoid_init(input, input_size, *output);
|
||||
```
|
||||
|
||||
function will help us to do this. All non-safe memory regions will be collected in the:
|
||||
|
||||
```C
|
||||
struct mem_vector {
|
||||
unsigned long long start;
|
||||
unsigned long long size;
|
||||
};
|
||||
|
||||
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
|
||||
```
|
||||
|
||||
array. Where `MEM_AVOID_MAX` is from `mem_avoid_index` [enum](https://en.wikipedia.org/wiki/Enumerated_type#C) which represents different types of reserved memory regions:
|
||||
|
||||
```C
|
||||
enum mem_avoid_index {
|
||||
MEM_AVOID_ZO_RANGE = 0,
|
||||
MEM_AVOID_INITRD,
|
||||
MEM_AVOID_CMDLINE,
|
||||
MEM_AVOID_BOOTPARAMS,
|
||||
MEM_AVOID_MEMMAP_BEGIN,
|
||||
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
|
||||
MEM_AVOID_MAX,
|
||||
};
|
||||
```
|
||||
|
||||
Both are defined in the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/kaslr.c) source code file.
|
||||
|
||||
Let's look at the implementation of the `mem_avoid_init` function. The main goal of this function is to store information about reseved memory regions described by the `mem_avoid_index` enum in the `mem_avoid` array and create new pages for such regions in our new identity mapped buffer. Numerous parts fo the `mem_avoid_index` function are similar, but let's take a look at the one of them:
|
||||
|
||||
```C
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].start = input;
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;
|
||||
add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start,
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].size);
|
||||
```
|
||||
|
||||
At the beginning of the `mem_avoid_init` function tries to avoid memory region that is used for current kernel decompression. We fill an entry from the `mem_avoid` array with the start and size of such region and call the `add_identity_map` function which should build identity mapped pages for this region. The `add_identity_map` function is defined in the [arch/x86/boot/compressed/kaslr_64.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/kaslr_64.c) source code file and looks:
|
||||
|
||||
```C
|
||||
void add_identity_map(unsigned long start, unsigned long size)
|
||||
{
|
||||
unsigned long end = start + size;
|
||||
|
||||
start = round_down(start, PMD_SIZE);
|
||||
end = round_up(end, PMD_SIZE);
|
||||
if (start >= end)
|
||||
return;
|
||||
|
||||
kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,
|
||||
start, end);
|
||||
}
|
||||
```
|
||||
|
||||
As you may see it aligns memory region to 2 megabytes boundary and checks given start and end addresses.
|
||||
|
||||
In the end it just calls the `kernel_ident_mapping_init` function from the [arch/x86/mm/ident_map.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/mm/ident_map.c) source code file and pass `mapping_info` instance that was initilized above, address of the top level page table and addresses of memory region for which new identity mapping should be built.
|
||||
|
||||
The `kernel_ident_mapping_init` function sets default flags for new pages if they were not given:
|
||||
|
||||
```C
|
||||
if (!info->kernpg_flag)
|
||||
info->kernpg_flag = _KERNPG_TABLE;
|
||||
```
|
||||
|
||||
and starts to build new 2-megabytes (because of `PSE` bit in the `mapping_info.page_flag`) page entries (`PGD -> P4D -> PUD -> PMD` in a case of [five-level page tables](https://lwn.net/Articles/717293/) or `PGD -> PUD -> PMD` in a case of [four-level page tables](https://lwn.net/Articles/117749/)) related to the given addresses.
|
||||
|
||||
```C
|
||||
for (; addr < end; addr = next) {
|
||||
p4d_t *p4d;
|
||||
|
||||
next = (addr & PGDIR_MASK) + PGDIR_SIZE;
|
||||
if (next > end)
|
||||
next = end;
|
||||
|
||||
p4d = (p4d_t *)info->alloc_pgt_page(info->context);
|
||||
result = ident_p4d_init(info, p4d, addr, next);
|
||||
|
||||
return result;
|
||||
}
|
||||
```
|
||||
|
||||
First of all here we find next entry of the `Page Global Directory` for the given address and if it is greater than `end` of the given memory region, we set it to `end`. After this we allocate a new page with our `x86_mapping_info` callback that we already considered above and call the `ident_p4d_init` function. The `ident_p4d_init` function will do the same, but for low-level page directories (`p4d` -> `pud` -> `pmd`).
|
||||
|
||||
That's all.
|
||||
|
||||
New page entries related to reserved addresses are in our page tables. This is not the end of the `mem_avoid_init` function, but other parts are similar. It just build pages for [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), kernel command line and etc.
|
||||
|
||||
Now we may return back to `choose_random_location` function.
|
||||
|
||||
Physical address randomization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the reserved memory regions were stored in the `mem_avoid` array and identity mapping pages were built for them, we select minimal available address to choose random memory region to decompress the kernel:
|
||||
|
||||
```C
|
||||
min_addr = min(*output, 512UL << 20);
|
||||
```
|
||||
|
||||
As you may see it should be smaller than `512` megabytes. This `512` megabytes value was selected just to avoid unknown things in lower memory.
|
||||
|
||||
The next step is to select random physical and virtual addresses to load kernel. The first is physical addresses:
|
||||
|
||||
```C
|
||||
random_addr = find_random_phys_addr(min_addr, output_size);
|
||||
```
|
||||
|
||||
The `find_random_phys_addr` function is defined in the [same](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/kaslr.c) source code file:
|
||||
|
||||
```
|
||||
static unsigned long find_random_phys_addr(unsigned long minimum,
|
||||
unsigned long image_size)
|
||||
{
|
||||
minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
|
||||
|
||||
if (process_efi_entries(minimum, image_size))
|
||||
return slots_fetch_random();
|
||||
|
||||
process_e820_entries(minimum, image_size);
|
||||
return slots_fetch_random();
|
||||
}
|
||||
```
|
||||
|
||||
The main goal of `process_efi_entries` function is to find all suitable memory ranges in full accessible memory to load kernel. If the kernel compiled and runned on the system without [EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) support, we continue to search such memory regions in the [e820](https://en.wikipedia.org/wiki/E820) regions. All founded memory regions will be stored in the
|
||||
|
||||
```C
|
||||
struct slot_area {
|
||||
unsigned long addr;
|
||||
int num;
|
||||
};
|
||||
|
||||
#define MAX_SLOT_AREA 100
|
||||
|
||||
static struct slot_area slot_areas[MAX_SLOT_AREA];
|
||||
```
|
||||
|
||||
array. The kernel will select a random index of this array for kernel to be decompressed. This selection will be executed by the `slots_fetch_random` function. The main goal of the `slots_fetch_random` function is to select random memory range from the `slot_areas` array via `kaslr_get_random_long` function:
|
||||
|
||||
```C
|
||||
slot = kaslr_get_random_long("Physical") % slot_max;
|
||||
```
|
||||
|
||||
The `kaslr_get_random_long` function is defined in the [arch/x86/lib/kaslr.c](https://github.com/torvalds/linux/blob/v4.16/arch/x86/lib/kaslr.c) source code file and it just returns random number. Note that the random number will be get via different ways depends on kernel configuration and system opportunities (select random number base on [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter), [rdrand](https://en.wikipedia.org/wiki/RdRand) and so on).
|
||||
|
||||
That's all from this point random memory range will be selected.
|
||||
|
||||
Virtual address randomization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After random memory region was selected by the kernel decompressor, new identity mapped pages will be built for this region by demand:
|
||||
|
||||
```C
|
||||
random_addr = find_random_phys_addr(min_addr, output_size);
|
||||
|
||||
if (*output != random_addr) {
|
||||
add_identity_map(random_addr, output_size);
|
||||
*output = random_addr;
|
||||
}
|
||||
```
|
||||
|
||||
From this time `output` will store the base address of a memory region where kernel will be decompressed. But for this moment, as you may remember we randomized only physical address. Virtual address should be randomized too in a case of [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
|
||||
```C
|
||||
if (IS_ENABLED(CONFIG_X86_64))
|
||||
random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);
|
||||
|
||||
*virt_addr = random_addr;
|
||||
```
|
||||
|
||||
As you may see in a case of non `x86_64` architecture, randomzed virtual address will coincide with randomized physical address. The `find_random_virt_addr` function calculates amount of virtual memory ranges that may hold kernel image and calls the `kaslr_get_random_long` that we already saw in a previous case when we tried to find random `physical` address.
|
||||
|
||||
From this moment we have both randomized base physical (`*output`) and virtual (`*virt_addr`) addresses for decompressed kernel.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the sixth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
|
||||
|
||||
Next chapter will be about kernel initialization and we will see the first steps in the Linux kernel initialization code.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
|
||||
* [Linux kernel boot protocol](https://github.com/torvalds/linux/blob/v4.16/Documentation/x86/boot.txt)
|
||||
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk)
|
||||
* [Enumerated type](https://en.wikipedia.org/wiki/Enumerated_type#C)
|
||||
* [four-level page tables](https://lwn.net/Articles/117749/)
|
||||
* [five-level page tables](https://lwn.net/Articles/717293/)
|
||||
* [EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)
|
||||
* [e820](https://en.wikipedia.org/wiki/E820)
|
||||
* [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [rdrand](https://en.wikipedia.org/wiki/RdRand)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Previous part](linux-bootstrap-5.md)
|
@ -0,0 +1,369 @@
|
||||
Notification Chains in Linux Kernel
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The Linux kernel is huge piece of [C](https://en.wikipedia.org/wiki/C_(programming_language)) code which consists from many different subsystems. Each subsystem has its own purpose which is independent of other subsystems. But often one subsystem wants to know something from other subsystem(s). There is special mechanism in the Linux kernel which allows to solve this problem partly. The name of this mechanism is - `notification chains` and its main purpose to provide a way for different subsystems to subscribe on asynchronous events from other subsystems. Note that this mechanism is only for communication inside kernel, but there are other mechanisms for communication between kernel and userspace.
|
||||
|
||||
Before we will consider `notification chains` [API](https://en.wikipedia.org/wiki/Application_programming_interface) and implementation of this API, let's look at `Notification chains` mechanism from theoretical side as we did it in other parts of this book. Everything which is related to `notification chains` mechanism is located in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file and [kernel/notifier.c](https://github.com/torvalds/linux/blob/master/kernel/notifier.c) source code file. So let's open them and start to dive.
|
||||
|
||||
Notification Chains related data structures
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Let's start to consider `notification chains` mechanism from related data structures. As I wrote above, main data structures should be located in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, so the Linux kernel provides generic API which does not depend on certain architecture. In general, the `notification chains` mechanism represents a list (that's why it named `chains`) of [callback](https://en.wikipedia.org/wiki/Callback_(computer_programming)) functions which are will be executed when an event will be occurred.
|
||||
|
||||
All of these callback functions are represented as `notifier_fn_t` type in the Linux kernel:
|
||||
|
||||
```C
|
||||
typedef int (*notifier_fn_t)(struct notifier_block *nb, unsigned long action, void *data);
|
||||
```
|
||||
|
||||
So we may see that it takes three following arguments:
|
||||
|
||||
* `nb` - is linked list of function pointers (will see it now);
|
||||
* `action` - is type of an event. A notification chain may support multiple events, so we need this parameter to distinguish an event from other events;
|
||||
* `data` - is storage for private information. Actually it allows to provide additional data information about an event.
|
||||
|
||||
Additionally we may see that `notifier_fn_t` returns an integer value. This integer value maybe one of:
|
||||
|
||||
* `NOTIFY_DONE` - subscriber does not interested in notification;
|
||||
* `NOTIFY_OK` - notification was processed correctly;
|
||||
* `NOTIFY_BAD` - something went wrong;
|
||||
* `NOTIFY_STOP` - notification is done, but no further callbacks should be called for this event.
|
||||
|
||||
All of these results defined as macros in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file:
|
||||
|
||||
```C
|
||||
#define NOTIFY_DONE 0x0000
|
||||
#define NOTIFY_OK 0x0001
|
||||
#define NOTIFY_BAD (NOTIFY_STOP_MASK|0x0002)
|
||||
#define NOTIFY_STOP (NOTIFY_OK|NOTIFY_STOP_MASK)
|
||||
```
|
||||
|
||||
Where `NOTIFY_STOP_MASK` represented by the:
|
||||
|
||||
```C
|
||||
#define NOTIFY_STOP_MASK 0x8000
|
||||
```
|
||||
|
||||
macro and means that callbacks will not be called during next notifications.
|
||||
|
||||
Each part of the Linux kernel which wants to be notified on a certain event will should provide own `notifier_fn_t` callback function. Main role of the `notification chains` mechanism is to call certain callbacks when an asynchronous event occurred.
|
||||
|
||||
The main building block of the `notification chains` mechanism is the `notifier_block` structure:
|
||||
|
||||
```C
|
||||
struct notifier_block {
|
||||
notifier_fn_t notifier_call;
|
||||
struct notifier_block __rcu *next;
|
||||
int priority;
|
||||
};
|
||||
```
|
||||
|
||||
which is defined in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) file. This struct contains pointer to callback function - `notifier_call`, link to the next notification callback and `priority` of a callback function as functions with higher priority are executed first.
|
||||
|
||||
The Linux kernel provides notification chains of four following types:
|
||||
|
||||
* Blocking notifier chains;
|
||||
* SRCU notifier chains;
|
||||
* Atomic notifier chains;
|
||||
* Raw notifier chains.
|
||||
|
||||
Let's consider all of these types of notification chains by order:
|
||||
|
||||
In the first case for the `blocking notifier chains`, callbacks will be called/executed in process context. This means that the calls in a notification chain may be blocked.
|
||||
|
||||
The second `SRCU notifier chains` represent alternative form of `blocking notifier chains`. In the first case, blocking notifier chains uses `rw_semaphore` synchronization primitive to protect chain links. `SRCU` notifier chains run in process context too, but uses special form of [RCU](https://en.wikipedia.org/wiki/Read-copy-update) mechanism which is permissible to block in an read-side critical section.
|
||||
|
||||
In the third case for the `atomic notifier chains` runs in interrupt or atomic context and protected by [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) synchronization primitive. The last `raw notifier chains` provides special type of notifier chains without any locking restrictions on callbacks. This means that protection rests on the shoulders of caller side. It is very useful when we want to protect our chain with very specific locking mechanism.
|
||||
|
||||
If we will look at the implementation of the `notifier_block` structure, we will see that it contains pointer to the `next` element from a notification chain list, but we have no head. Actually a head of such list is in separate structure depends on type of a notification chain. For example for the `blocking notifier chains`:
|
||||
|
||||
```C
|
||||
struct blocking_notifier_head {
|
||||
struct rw_semaphore rwsem;
|
||||
struct notifier_block __rcu *head;
|
||||
};
|
||||
```
|
||||
|
||||
or for `atomic notification chains`:
|
||||
|
||||
```C
|
||||
struct atomic_notifier_head {
|
||||
spinlock_t lock;
|
||||
struct notifier_block __rcu *head;
|
||||
};
|
||||
```
|
||||
|
||||
Now as we know a little about `notification chains` mechanism let's consider implementation of its API.
|
||||
|
||||
Notification Chains
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Usually there are two sides in a publish/subscriber mechanisms. One side who wants to get notifications and other side(s) who generates these notifications. We will consider notification chains mechanism from both sides. We will consider `blocking notification chains` in this part, because of other types of notification chains are similar to it and differs mostly in protection mechanisms.
|
||||
|
||||
Before a notification producer is able to produce notification, first of all it should initialize head of a notification chain. For example let's consider notification chains related to kernel [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module). If we will look in the [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c) source code file, we will see following definition:
|
||||
|
||||
```C
|
||||
static BLOCKING_NOTIFIER_HEAD(module_notify_list);
|
||||
```
|
||||
|
||||
which defines head for loadable modules blocking notifier chain. The `BLOCKING_NOTIFIER_HEAD` macro is defined in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file and expands to the following code:
|
||||
|
||||
```C
|
||||
#define BLOCKING_INIT_NOTIFIER_HEAD(name) do { \
|
||||
init_rwsem(&(name)->rwsem); \
|
||||
(name)->head = NULL; \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
So we may see that it takes name of a name of a head of a blocking notifier chain and initializes read/write [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html) and set head to `NULL`. Besides the `BLOCKING_INIT_NOTIFIER_HEAD` macro, the Linux kernel additionally provides `ATOMIC_INIT_NOTIFIER_HEAD`, `RAW_INIT_NOTIFIER_HEAD` macros and `srcu_init_notifier` function for initialization atomic and other types of notification chains.
|
||||
|
||||
After initialization of a head of a notification chain, a subsystem which wants to receive notification from the given notification chain it should register with certain function which is depends on type of notification. If you will look in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, you will see following four function for this:
|
||||
|
||||
```C
|
||||
extern int atomic_notifier_chain_register(struct atomic_notifier_head *nh,
|
||||
struct notifier_block *nb);
|
||||
|
||||
extern int blocking_notifier_chain_register(struct blocking_notifier_head *nh,
|
||||
struct notifier_block *nb);
|
||||
|
||||
extern int raw_notifier_chain_register(struct raw_notifier_head *nh,
|
||||
struct notifier_block *nb);
|
||||
|
||||
extern int srcu_notifier_chain_register(struct srcu_notifier_head *nh,
|
||||
struct notifier_block *nb);
|
||||
```
|
||||
|
||||
As I already wrote above, we will cover only blocking notification chains in the part, so let's consider implementation of the `blocking_notifier_chain_register` function. Implementation of this function is located in the [kernel/notifier.c](https://github.com/torvalds/linux/blob/master/kernel/notifier.c) source code file and as we may see the `blocking_notifier_chain_register` takes two parameters:
|
||||
|
||||
* `nh` - head of a notification chain;
|
||||
* `nb` - notification descriptor.
|
||||
|
||||
Now let's look at the implementation of the `blocking_notifier_chain_register` function:
|
||||
|
||||
```C
|
||||
int raw_notifier_chain_register(struct raw_notifier_head *nh,
|
||||
struct notifier_block *n)
|
||||
{
|
||||
return notifier_chain_register(&nh->head, n);
|
||||
}
|
||||
```
|
||||
|
||||
As we may see it just returns result of the `notifier_chain_register` function from the same source code file and as we may understand this function does all job for us. Definition of the `notifier_chain_register` function looks:
|
||||
|
||||
```C
|
||||
int blocking_notifier_chain_register(struct blocking_notifier_head *nh,
|
||||
struct notifier_block *n)
|
||||
{
|
||||
int ret;
|
||||
|
||||
if (unlikely(system_state == SYSTEM_BOOTING))
|
||||
return notifier_chain_register(&nh->head, n);
|
||||
|
||||
down_write(&nh->rwsem);
|
||||
ret = notifier_chain_register(&nh->head, n);
|
||||
up_write(&nh->rwsem);
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
As we may see implementation of the `blocking_notifier_chain_register` is pretty simple. First of all there is check which check current system state and if a system in rebooting state we just call the `notifier_chain_register`. In other way we do the same call of the `notifier_chain_register` but as you may see this call is protected with read/write semaphores. Now let's look at the implementation of the `notifier_chain_register` function:
|
||||
|
||||
```C
|
||||
static int notifier_chain_register(struct notifier_block **nl,
|
||||
struct notifier_block *n)
|
||||
{
|
||||
while ((*nl) != NULL) {
|
||||
if (n->priority > (*nl)->priority)
|
||||
break;
|
||||
nl = &((*nl)->next);
|
||||
}
|
||||
n->next = *nl;
|
||||
rcu_assign_pointer(*nl, n);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
This function just inserts new `notifier_block` (given by a subsystem which wants to get notifications) to the notification chain list. Besides subscribing on an event, subscriber may unsubscribe from a certain events with the set of `unsubscribe` functions:
|
||||
|
||||
```C
|
||||
extern int atomic_notifier_chain_unregister(struct atomic_notifier_head *nh,
|
||||
struct notifier_block *nb);
|
||||
|
||||
extern int blocking_notifier_chain_unregister(struct blocking_notifier_head *nh,
|
||||
struct notifier_block *nb);
|
||||
|
||||
extern int raw_notifier_chain_unregister(struct raw_notifier_head *nh,
|
||||
struct notifier_block *nb);
|
||||
|
||||
extern int srcu_notifier_chain_unregister(struct srcu_notifier_head *nh,
|
||||
struct notifier_block *nb);
|
||||
```
|
||||
|
||||
When a producer of notifications wants to notify subscribers about an event, the `*.notifier_call_chain` function will be called. As you already may guess each type of notification chains provides own function to produce notification:
|
||||
|
||||
```C
|
||||
extern int atomic_notifier_call_chain(struct atomic_notifier_head *nh,
|
||||
unsigned long val, void *v);
|
||||
|
||||
extern int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
|
||||
unsigned long val, void *v);
|
||||
|
||||
extern int raw_notifier_call_chain(struct raw_notifier_head *nh,
|
||||
unsigned long val, void *v);
|
||||
|
||||
extern int srcu_notifier_call_chain(struct srcu_notifier_head *nh,
|
||||
unsigned long val, void *v);
|
||||
```
|
||||
|
||||
Let's consider implementation of the `blocking_notifier_call_chain` function. This function is defined in the [kernel/notifier.c](https://github.com/torvalds/linux/blob/master/kernel/notifier.c) source code file:
|
||||
|
||||
```C
|
||||
int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
|
||||
unsigned long val, void *v)
|
||||
{
|
||||
return __blocking_notifier_call_chain(nh, val, v, -1, NULL);
|
||||
}
|
||||
```
|
||||
|
||||
and as we may see it just returns result of the `__blocking_notifier_call_chain` function. As we may see, the `blocking_notifer_call_chain` takes three parameters:
|
||||
|
||||
* `nh` - head of notification chain list;
|
||||
* `val` - type of a notification;
|
||||
* `v` - input parameter which may be used by handlers.
|
||||
|
||||
But the `__blocking_notifier_call_chain` function takes five parameters:
|
||||
|
||||
```C
|
||||
int __blocking_notifier_call_chain(struct blocking_notifier_head *nh,
|
||||
unsigned long val, void *v,
|
||||
int nr_to_call, int *nr_calls)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Where `nr_to_call` and `nr_calls` are number of notifier functions to be called and number of sent notifications. As you may guess the main goal of the `__blocking_notifer_call_chain` function and other functions for other notification types is to call callback function when an event occurred. Implementation of the `__blocking_notifier_call_chain` is pretty simple, it just calls the `notifier_call_chain` function from the same source code file protected with read/write semaphore:
|
||||
|
||||
```C
|
||||
int __blocking_notifier_call_chain(struct blocking_notifier_head *nh,
|
||||
unsigned long val, void *v,
|
||||
int nr_to_call, int *nr_calls)
|
||||
{
|
||||
int ret = NOTIFY_DONE;
|
||||
|
||||
if (rcu_access_pointer(nh->head)) {
|
||||
down_read(&nh->rwsem);
|
||||
ret = notifier_call_chain(&nh->head, val, v, nr_to_call,
|
||||
nr_calls);
|
||||
up_read(&nh->rwsem);
|
||||
}
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
and returns its result. In this case all job is done by the `notifier_call_chain` function. Main purpose of this function informs registered notifiers about an asynchronous event:
|
||||
|
||||
```C
|
||||
static int notifier_call_chain(struct notifier_block **nl,
|
||||
unsigned long val, void *v,
|
||||
int nr_to_call, int *nr_calls)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
ret = nb->notifier_call(nb, val, v);
|
||||
...
|
||||
...
|
||||
...
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
That's all. In generall all looks pretty simple.
|
||||
|
||||
Now let's consider on a simple example related to [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module). If we will look in the [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c). As we already saw in this part, there is:
|
||||
|
||||
```C
|
||||
static BLOCKING_NOTIFIER_HEAD(module_notify_list);
|
||||
```
|
||||
|
||||
definition of the `module_notify_list` in the [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c) source code file. This definition determines head of list of blocking notifier chains related to kernel modules. There are at least three following events:
|
||||
|
||||
* MODULE_STATE_LIVE
|
||||
* MODULE_STATE_COMING
|
||||
* MODULE_STATE_GOING
|
||||
|
||||
in which maybe interested some subsystems of the Linux kernel. For example tracing of kernel modules states. Instead of direct call of the `atomic_notifier_chain_register`, `blocking_notifier_chain_register` and etc., most notification chains come with a set of wrappers used to register to them. Registatrion on these modules events is going with the help of such wrapper:
|
||||
|
||||
```C
|
||||
int register_module_notifier(struct notifier_block *nb)
|
||||
{
|
||||
return blocking_notifier_chain_register(&module_notify_list, nb);
|
||||
}
|
||||
```
|
||||
|
||||
If we will look in the [kernel/tracepoint.c](https://github.com/torvalds/linux/blob/master/kernel/tracepoint.c) source code file, we will see such registration during initialization of [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt):
|
||||
|
||||
```C
|
||||
static __init int init_tracepoints(void)
|
||||
{
|
||||
int ret;
|
||||
|
||||
ret = register_module_notifier(&tracepoint_module_nb);
|
||||
if (ret)
|
||||
pr_warn("Failed to register tracepoint module enter notifier\n");
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
Where `tracepoint_module_nb` provides callback function:
|
||||
|
||||
```C
|
||||
static struct notifier_block tracepoint_module_nb = {
|
||||
.notifier_call = tracepoint_module_notify,
|
||||
.priority = 0,
|
||||
};
|
||||
```
|
||||
|
||||
When one of the `MODULE_STATE_LIVE`, `MODULE_STATE_COMING` or `MODULE_STATE_GOING` events occurred. For example the `MODULE_STATE_LIVE` the `MODULE_STATE_COMING` notifications will be sent during execution of the [init_module](http://man7.org/linux/man-pages/man2/init_module.2.html) [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html). Or for example `MODULE_STATE_GOING` will be sent during execution of the [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) `system call`:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
|
||||
unsigned int, flags)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
blocking_notifier_call_chain(&module_notify_list,
|
||||
MODULE_STATE_GOING, mod);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Thus when one of these system call will be called from userspace, the Linux kernel will send certain notification depends on a system call and the `tracepoint_module_notify` callback function will be called.
|
||||
|
||||
That's all.
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [C programming langauge](https://en.wikipedia.org/wiki/C_(programming_language))
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [callback](https://en.wikipedia.org/wiki/Callback_(computer_programming))
|
||||
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html)
|
||||
* [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module)
|
||||
* [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html)
|
||||
* [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt)
|
||||
* [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html)
|
||||
* [init_module system call](http://man7.org/linux/man-pages/man2/init_module.2.html)
|
||||
* [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-3.html)
|
@ -0,0 +1,3 @@
|
||||
FROM lrx0014/gitbook:3.2.3
|
||||
COPY ./ /srv/gitbook/
|
||||
EXPOSE 4000
|
@ -0,0 +1,14 @@
|
||||
# Interrupts and Interrupt Handling
|
||||
|
||||
In the following posts, we will cover interrupts and exceptions handling in the linux kernel.
|
||||
|
||||
* [Interrupts and Interrupt Handling. Part 1.](linux-interrupts-1.md) - describes interrupts and interrupt handling theory.
|
||||
* [Interrupts in the Linux Kernel](linux-interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
|
||||
* [Early interrupt handlers](linux-interrupts-3.md) - describes early interrupt handlers.
|
||||
* [Interrupt handlers](linux-interrupts-4.md) - describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](linux-interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
|
||||
* [Handling non-maskable interrupts](linux-interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
|
||||
* [External hardware interrupts](linux-interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](linux-interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](linux-interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
|
||||
* [Last part](linux-interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.
|
Binary file not shown.
@ -0,0 +1,21 @@
|
||||
# Scripts
|
||||
|
||||
## Description
|
||||
|
||||
`get_all_links.py` : justify one link is live or dead with network connection
|
||||
|
||||
`latex.sh` : a script for converting Markdown files in each of the subdirectories into a unified PDF typeset in LaTeX
|
||||
|
||||
## Usage
|
||||
|
||||
`get_all_links.py` :
|
||||
|
||||
```
|
||||
./get_all_links.py ../
|
||||
```
|
||||
|
||||
`latex.sh` :
|
||||
|
||||
```
|
||||
./latex.sh
|
||||
```
|
@ -0,0 +1,77 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
from __future__ import print_function
|
||||
from socket import timeout
|
||||
|
||||
import os
|
||||
import sys
|
||||
import codecs
|
||||
import re
|
||||
|
||||
import markdown
|
||||
|
||||
try:
|
||||
# compatible for python2
|
||||
from urllib2 import urlopen
|
||||
from urllib2 import HTTPError
|
||||
from urllib2 import URLError
|
||||
except ImportError:
|
||||
# compatible for python3
|
||||
from urllib.request import urlopen
|
||||
from urllib.error import HTTPError
|
||||
from urllib.error import URLError
|
||||
|
||||
def check_live_url(url):
|
||||
|
||||
result = False
|
||||
try:
|
||||
ret = urlopen(url, timeout=2)
|
||||
result = (ret.code == 200)
|
||||
except HTTPError as e:
|
||||
print(e, file=sys.stderr)
|
||||
except URLError as e:
|
||||
print(e, file=sys.stderr)
|
||||
except timeout as e:
|
||||
print(e, file=sys.stderr)
|
||||
except Exception as e:
|
||||
print(e, file=sys.stderr)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def main(path):
|
||||
|
||||
filenames = []
|
||||
for (dirpath, dnames, fnames) in os.walk(path):
|
||||
for fname in fnames:
|
||||
if fname.endswith('.md'):
|
||||
filenames.append(os.sep.join([dirpath, fname]))
|
||||
|
||||
urls = []
|
||||
|
||||
for filename in filenames:
|
||||
fd = codecs.open(filename, mode="r", encoding="utf-8")
|
||||
for line in fd.readlines():
|
||||
refs = re.findall(r'(?<=<a href=")[^"]*', markdown.markdown(line))
|
||||
for ref in refs:
|
||||
if ref not in urls:
|
||||
urls.append(ref)
|
||||
|
||||
#print(len(urls))
|
||||
|
||||
for url in urls:
|
||||
if not url.startswith("http"):
|
||||
print("markdown file name: " + url)
|
||||
continue
|
||||
if check_live_url(url):
|
||||
print(url)
|
||||
else:
|
||||
print(url, file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
if len(sys.argv) == 2:
|
||||
main(sys.argv[1])
|
||||
else:
|
||||
print("Choose one path as argument one")
|
@ -0,0 +1,26 @@
|
||||
# latex.sh
|
||||
# A script for converting Markdown files in each of the subdirectories into a unified PDF typeset in LaTeX.
|
||||
# Requires TexLive, Pandoc templates and pdfunite. Not necessary if you just want to read the PDF, only if you're compiling it yourself.
|
||||
|
||||
#!/bin/bash
|
||||
rm -r build
|
||||
mkdir build
|
||||
for D in $(ls ../); do
|
||||
if [ -d "../${D}" ]
|
||||
then
|
||||
echo "Converting $D . . ."
|
||||
pandoc ../$D/README.md ../$D/linux-*.md -o build/$D.tex --template default
|
||||
fi
|
||||
done
|
||||
|
||||
cd ./build
|
||||
for f in *.tex
|
||||
do
|
||||
pdflatex -interaction=nonstopmode $f
|
||||
done
|
||||
|
||||
cd ../
|
||||
pandoc ../README.md ../SUMMARY.md ../CONTRIBUTING.md ../contributors.md \
|
||||
-o ./build/Preface.tex --template default
|
||||
|
||||
pdfunite ./build/*.pdf LinuxKernelInsides.pdf
|
@ -0,0 +1,431 @@
|
||||
Synchronization primitives in the Linux kernel. Part 2.
|
||||
================================================================================
|
||||
|
||||
Queued Spinlocks
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the second part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the first [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) of this chapter we met the first - [spinlock](https://en.wikipedia.org/wiki/Spinlock). We will continue to learn this synchronization primitive in this part. If you have read the previous part, you may remember that besides normal spinlocks, the Linux kernel provides special type of `spinlocks` - `queued spinlocks`. In this part we will try to understand what this concept represents.
|
||||
|
||||
We saw [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `spinlock` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html):
|
||||
|
||||
* `spin_lock_init` - produces initialization of the given `spinlock`;
|
||||
* `spin_lock` - acquires given `spinlock`;
|
||||
* `spin_lock_bh` - disables software [interrupts](https://en.wikipedia.org/wiki/Interrupt) and acquire given `spinlock`.
|
||||
* `spin_lock_irqsave` and `spin_lock_irq` - disable interrupts on local processor and preserve/not preserve previous interrupt state in the `flags`;
|
||||
* `spin_unlock` - releases given `spinlock`;
|
||||
* `spin_unlock_bh` - releases given `spinlock` and enables software interrupts;
|
||||
* `spin_is_locked` - returns the state of the given `spinlock`;
|
||||
* and etc.
|
||||
|
||||
And we know that all of these macro which are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file will be expanded to the call of the functions with `arch_*` prefix from the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h):
|
||||
|
||||
```C
|
||||
#define arch_spin_is_locked(l) queued_spin_is_locked(l)
|
||||
#define arch_spin_is_contended(l) queued_spin_is_contended(l)
|
||||
#define arch_spin_value_unlocked(l) queued_spin_value_unlocked(l)
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
#define arch_spin_trylock(l) queued_spin_trylock(l)
|
||||
#define arch_spin_unlock(l) queued_spin_unlock(l)
|
||||
```
|
||||
|
||||
Before we consider how queued spinlocks and their [API](https://en.wikipedia.org/wiki/Application_programming_interface) are implemented, we will take a look on theoretical part at first.
|
||||
|
||||
Introduction to queued spinlocks
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
Queued spinlocks is a [locking mechanism](https://en.wikipedia.org/wiki/Lock_%28computer_science%29) in the Linux kernel which is replacement for the standard `spinlocks`. At least this is true for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. If we will look at the following kernel configuration file - [kernel/Kconfig.locks](https://github.com/torvalds/linux/blob/master/kernel/Kconfig.locks), we will see following configuration entries:
|
||||
|
||||
```
|
||||
config ARCH_USE_QUEUED_SPINLOCKS
|
||||
bool
|
||||
|
||||
config QUEUED_SPINLOCKS
|
||||
def_bool y if ARCH_USE_QUEUED_SPINLOCKS
|
||||
depends on SMP
|
||||
```
|
||||
|
||||
This means that the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option will be enabled by default if the `ARCH_USE_QUEUED_SPINLOCKS` is enabled. We may see that the `ARCH_USE_QUEUED_SPINLOCKS` is enabled by default in the `x86_64` specific kernel configuration file - [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig):
|
||||
|
||||
```
|
||||
config X86
|
||||
...
|
||||
...
|
||||
...
|
||||
select ARCH_USE_QUEUED_SPINLOCKS
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Before we start to consider what queued spinlock concept is, let's look on other types of `spinlocks`. For the start let's consider how `normal` spinlocks is implemented. Usually, implementation of `normal` spinlock is based on the [test and set](https://en.wikipedia.org/wiki/Test-and-set) instruction. Principle of work of this instruction is pretty simple. This instruction writes a value to the memory location and returns old value from there. Both of these instructions are in atomic context i.e. non-interruptible instructions. So if the first thread started to execute this instruction, second thread will wait until the first processor finishes its instruction. Basic lock can be built on top of this mechanism. Schematically it may look like this:
|
||||
|
||||
```C
|
||||
int lock(lock)
|
||||
{
|
||||
while (test_and_set(lock) == 1)
|
||||
;
|
||||
return 0;
|
||||
}
|
||||
|
||||
int unlock(lock)
|
||||
{
|
||||
lock=0;
|
||||
|
||||
return lock;
|
||||
}
|
||||
```
|
||||
|
||||
The first thread will execute the `test_and_set` which will set the `lock` to `1`. When the second thread calls the `lock` function, it will spin in the `while` loop, until the first thread call the `unlock` function and the `lock` will be equal to `0`. This implementation is not very good for performance, because it has at least two problems. The first problem is that this implementation may be unfair and the thread from one processor may have long waiting time, even if it called the `lock` before other threads which are waiting for free lock too. The second problem is that all threads which want to acquire a lock, must to execute many `atomic` operations like `test_and_set` on a variable which is in shared memory. This leads to the cache invalidation as the cache of the processor will store `lock=1`, but the value of the `lock` in memory may be `1` after a thread will release this lock.
|
||||
|
||||
The topic of this part is `queued spinlocks`. This approach may help to solve both of these problems. The `queued spinlocks` allows each processor to use its own memory location to spin. The basic principle of a queue-based spinlock can best be understood by studying a classic queue-based spinlock implementation called the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) lock. Before we look at implementation of the `queued spinlocks` in the Linux kernel, we will try to understand how `MCS` lock works.
|
||||
|
||||
The basic idea of the `MCS` lock is in that as I already wrote in the previous paragraph, a thread spins on a local variable and each processor in the system has its own copy of these variable. In other words this concept is built on top of the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variables concept in the Linux kernel.
|
||||
|
||||
When the first thread wants to acquire a lock, it registers itself in the `queue` or in other words it will be added to the special `queue` and will acquire lock, because it is free for now. When the second thread want to acquire the same lock before the first thread release it, this thread adds its own copy of the lock variable into this `queue`. In this case the first thread will contain a `next` field which will point to the second thread. From this moment, the second thread will wait until the first thread release its lock and notify `next` thread about this event. The first thread will be deleted from the `queue` and the second thread will be owner of a lock.
|
||||
|
||||
Schematically we can represent it like:
|
||||
|
||||
Empty queue:
|
||||
|
||||
```
|
||||
+---------+
|
||||
| |
|
||||
| Queue |
|
||||
| |
|
||||
+---------+
|
||||
```
|
||||
|
||||
First thread tries to acquire a lock:
|
||||
|
||||
```
|
||||
+---------+ +----------------------------+
|
||||
| | | |
|
||||
| Queue |---->| First thread acquired lock |
|
||||
| | | |
|
||||
+---------+ +----------------------------+
|
||||
```
|
||||
|
||||
Second thread tries to acquire a lock:
|
||||
|
||||
```
|
||||
+---------+ +----------------------------------------+ +-------------------------+
|
||||
| | | | | |
|
||||
| Queue |---->| Second thread waits for first thread |<----| First thread holds lock |
|
||||
| | | | | |
|
||||
+---------+ +----------------------------------------+ +-------------------------+
|
||||
```
|
||||
|
||||
Or the pseudocode:
|
||||
|
||||
```C
|
||||
void lock(...)
|
||||
{
|
||||
lock.next = NULL;
|
||||
ancestor = put_lock_to_queue_and_return_ancestor(queue, lock);
|
||||
|
||||
// if we have ancestor, the lock already acquired and we
|
||||
// need to wait until it is released
|
||||
if (ancestor)
|
||||
{
|
||||
lock.is_locked = 1;
|
||||
ancestor.next = lock;
|
||||
|
||||
while (lock.is_locked == true)
|
||||
;
|
||||
}
|
||||
|
||||
// in other way we are owner of the lock and may exit
|
||||
}
|
||||
|
||||
void unlock(...)
|
||||
{
|
||||
// do we need to notify somebody or we are alone in the
|
||||
// queue?
|
||||
if (lock.next != NULL) {
|
||||
// the while loop from the lock() function will be
|
||||
// finished
|
||||
lock.next.is_locked = false;
|
||||
}
|
||||
|
||||
// So, we have no next threads in the queue to notify about
|
||||
// lock releasing event. Let's just put `0` to the lock, will
|
||||
// delete ourself from the queue and exit.
|
||||
}
|
||||
```
|
||||
|
||||
That's all about theory of the `queued spinlocks`, now let's consider how this mechanism is implemented in the Linux kernel. Unlike above pseudocode, the implementation of the `queued spinlocks` looks complex and tangled. But the study with attention will lead to success.
|
||||
|
||||
API of queued spinlocks
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
Now we know a little about `queued spinlocks` from the theoretical side, time to see the implementation of this mechanism in the Linux kernel. As we saw above, the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h) header file provides a set of macro which represents API for a spinlock acquiring, releasing and etc:
|
||||
|
||||
```C
|
||||
#define arch_spin_is_locked(l) queued_spin_is_locked(l)
|
||||
#define arch_spin_is_contended(l) queued_spin_is_contended(l)
|
||||
#define arch_spin_value_unlocked(l) queued_spin_value_unlocked(l)
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
#define arch_spin_trylock(l) queued_spin_trylock(l)
|
||||
#define arch_spin_unlock(l) queued_spin_unlock(l)
|
||||
```
|
||||
|
||||
All of these macros expand to the call of functions from the same header file. Additionally, we saw the `qspinlock` structure from the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file which represents a queued spinlock in the Linux kernel:
|
||||
|
||||
```C
|
||||
typedef struct qspinlock {
|
||||
union {
|
||||
atomic_t val;
|
||||
|
||||
struct {
|
||||
u8 locked;
|
||||
u8 pending;
|
||||
};
|
||||
struct {
|
||||
u16 locked_pending;
|
||||
u16 tail;
|
||||
};
|
||||
};
|
||||
} arch_spinlock_t;
|
||||
```
|
||||
|
||||
The `val` field represents the state of a given `spinlock`. This `4` bytes field consists from following parts:
|
||||
|
||||
* `0-7` - locked byte;
|
||||
* `8` - pending bit;
|
||||
* `9-15` - not used;
|
||||
* `16-17` - two bit index which represents entry of the `per-cpu` array of the `MCS` lock (will see it soon);
|
||||
* `18-31` - contains number of processor which indicates tail of the queue.
|
||||
|
||||
Before we move to consider `API` of `queued spinlocks`, notice the `val` field of the `qspinlock` structure has type - `atomic_t` which represents atomic variable or one operation at a time variable. So, all operations with this field will be [atomic](https://en.wikipedia.org/wiki/Linearizability). For example let's look at the reading value of the `val` API:
|
||||
|
||||
```C
|
||||
static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
|
||||
{
|
||||
return atomic_read(&lock->val);
|
||||
}
|
||||
```
|
||||
|
||||
Ok, now we know data structures which represents queued spinlock in the Linux kernel and now is the time to look at the implementation of the main function from the `queued spinlocks` [API](https://en.wikipedia.org/wiki/Application_programming_interface):
|
||||
|
||||
```C
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
```
|
||||
|
||||
Yes, this function is - `queued_spin_lock`. As we may understand from the function's name, it allows to acquire lock by the thread. This function is defined in the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file and its implementation looks:
|
||||
|
||||
```C
|
||||
static __always_inline void queued_spin_lock(struct qspinlock *lock)
|
||||
{
|
||||
u32 val;
|
||||
|
||||
val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
|
||||
if (likely(val == 0))
|
||||
return;
|
||||
queued_spin_lock_slowpath(lock, val);
|
||||
}
|
||||
```
|
||||
|
||||
Looks pretty easy, except the `queued_spin_lock_slowpath` function. We may see that it takes only one parameter. In our case this parameter will represent `queued spinlock` which will be locked. Let's consider the situation that `queue` with locks is empty for now and the first thread wanted to acquire lock. As we may see the `queued_spin_lock` function starts from the call of the `atomic_cmpxchg_acquire` macro. As you may guess from its name, it executes atomic [CMPXCHG](http://x86.renejeschke.de/html/file_module_x86_id_41.html) instruction. Ultimately, the `atomic_cmpxchg_acquire` macro expands to the call of the `__raw_cmpxchg` macro almost like the following:
|
||||
|
||||
```C
|
||||
#define __raw_cmpxchg(ptr, old, new, size, lock) \
|
||||
({ \
|
||||
__typeof__(*(ptr)) __ret; \
|
||||
__typeof__(*(ptr)) __old = (old); \
|
||||
__typeof__(*(ptr)) __new = (new); \
|
||||
\
|
||||
volatile u32 *__ptr = (volatile u32 *)(ptr); \
|
||||
asm volatile(lock "cmpxchgl %2,%1" \
|
||||
: "=a" (__ret), "+m" (*__ptr) \
|
||||
: "r" (__new), "0" (__old) \
|
||||
: "memory"); \
|
||||
\
|
||||
__ret; \
|
||||
})
|
||||
```
|
||||
|
||||
which compares the `old` with the value which the `ptr` points to and if they are identical, it stores the `new` in the memory location which is pointed by the `ptr` and returns the initial value in this memory location. In our case,
|
||||
|
||||
Let's back to the `queued_spin_lock` function. Assuming that we are the first one who tried to acquire the lock, the `val` will be zero and we will return from the `queued_spin_lock` function:
|
||||
|
||||
```C
|
||||
val = atomic_cmpxchg_acquire(&lock-val, 0, _Q_LOCKED_VAL);
|
||||
if (likely(val == 0))
|
||||
return;
|
||||
```
|
||||
|
||||
So far, we've considered uncontended case (i.e. fast-path). Now let's consider contended case (i.e. slow-path). Suppose that one thread tried to acquire a lock, but the lock is already held, then `queued_spin_lock_slowpath` will be called. The `queued_spin_lock_slowpath` function is defined in the [kernel/locking/qspinlock.c](https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c) source code file:
|
||||
|
||||
```C
|
||||
void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
if (val == _Q_PENDING_VAL) {
|
||||
int cnt = _Q_PENDING_LOOPS;
|
||||
val = atomic_cond_read_relaxed(&lock-val,
|
||||
(VAL != _Q_PENDING_VAL) || !cnt--);
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
which wait for in-progress lock acquisition to be done with a bounded number of spins so that we guarantee forward progress. Above, we saw that the lock contains - pending bit. This bit represents thread which wanted to acquire lock, but it is already acquired by the other thread and `queue` is empty at the same time. In this case, the pending bit will be set and the `queue` will not be touched. This is done for optimization, because there are no need in unnecessary latency which will be caused by the cache invalidation in a touching of own `mcs_spinlock` array.
|
||||
|
||||
If we observe contention, then we have no choice other than queueing, so jump to `queue` label that we'll see later:
|
||||
|
||||
```C
|
||||
if (val & ~_Q_LOCKED_MASK)
|
||||
goto queue;
|
||||
```
|
||||
|
||||
So, the lock is already held. That is, we set the pending bit of the lock:
|
||||
|
||||
```C
|
||||
val = queued_fetch_set_pending_acquire(lock);
|
||||
```
|
||||
|
||||
Again if we observe contention, undo the pending and queue.
|
||||
|
||||
```C
|
||||
if (unlikely(val & ~_Q_LOCKED_MASK)) {
|
||||
if (!(val & _Q_PENDING_MASK))
|
||||
clear_pending(lock);
|
||||
goto queue;
|
||||
}
|
||||
```
|
||||
|
||||
Now, we're pending, wait for the lock owner to release it.
|
||||
|
||||
```C
|
||||
if (val & _Q_LOCKED_MASK)
|
||||
atomic_cond_read_acquire(&)
|
||||
```
|
||||
|
||||
We are allowed to take the lock. So, we clear the pending bit and set the locked bit. Now we have nothing to do with the `queued_spin_lock_slowpath` function, return from it.
|
||||
|
||||
```C
|
||||
clear_pending_set_locked(lock);
|
||||
return;
|
||||
```
|
||||
|
||||
Before diving into queueing, we'll see about `MCS` lock mechanism first. As we already know, each processor in the system has own copy of the lock. The lock is represented by the following structure:
|
||||
|
||||
```C
|
||||
struct mcs_spinlock {
|
||||
struct mcs_spinlock *next;
|
||||
int locked;
|
||||
int count;
|
||||
};
|
||||
```
|
||||
|
||||
from the [kernel/locking/mcs_spinlock.h](https://github.com/torvalds/linux/blob/master/kernel/locking/mcs_spinlock.h) header file. The first field represents a pointer to the next thread in the `queue`. The second field represents the state of the current thread in the `queue`, where `1` is `lock` already acquired and `0` in other way. And the last field of the `mcs_spinlock` structure represents nested locks. To understand what nested lock is, imagine situation when a thread acquired lock, but was interrupted by the hardware [interrupt](https://en.wikipedia.org/wiki/Interrupt) and an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) tries to take a lock too. For this case, each processor has not just copy of the `mcs_spinlock` structure but array of these structures:
|
||||
|
||||
```C
|
||||
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
|
||||
```
|
||||
|
||||
This array allows to make four attempts of a lock acquisition for the four events in following contexts:
|
||||
|
||||
* normal task context;
|
||||
* hardware interrupt context;
|
||||
* software interrupt context;
|
||||
* non-maskable interrupt context.
|
||||
|
||||
Notice that we did not touch `queue` yet. We no need in it, because for two threads it just leads to unnecessary latency for memory access. In other case, the first thread may release it lock before this moment. In this case the `lock->val` will contain `_Q_LOCKED_VAL | _Q_PENDING_VAL` and we will start to build `queue`. We start to build `queue` by the getting the local copy of the `qnodes` array of the processor which executes thread and calculate `tail` which will indicate the tail of the `queue` and `idx` which represents an index of the `qnodes` array:
|
||||
|
||||
```C
|
||||
queue:
|
||||
node = this_cpu_ptr(&qnodes[0].mcs);
|
||||
idx = node->count++;
|
||||
tail = encode_tail(smp_processer_id(), idx);
|
||||
|
||||
node = grab_mcs_node(node, idx);
|
||||
```
|
||||
|
||||
After this, we set `locked` to zero because this thread didn't acquire lock yet and `next` to `NULL` because we don't know anything about other `queue` entries:
|
||||
|
||||
```C
|
||||
node->locked = 0;
|
||||
node->next = NULL;
|
||||
```
|
||||
|
||||
We already touched `per-cpu` copy of the queue for the processor which executes current thread which wants to acquire lock, this means that owner of the lock may released it before this moment. So we may try to acquire lock again by the call of the `queued_spin_trylock` function:
|
||||
|
||||
```C
|
||||
if (queued_spin_trylock(lock))
|
||||
goto release;
|
||||
```
|
||||
|
||||
It does the almost same thing `queued_spin_lock` function does.
|
||||
|
||||
If the lock was successfully acquired we jump to the `release` label to release a node of the `queue`:
|
||||
|
||||
```C
|
||||
release:
|
||||
__this_cpu_dec(qnodes[0].mcs.count);
|
||||
```
|
||||
|
||||
because we no need in it anymore as lock is acquired. If the `queued_spin_trylock` was unsuccessful, we update tail of the queue:
|
||||
|
||||
```C
|
||||
old = xchg_tail(lock, tail);
|
||||
next = NULL;
|
||||
```
|
||||
|
||||
and retrieve previous tail. The next step is to check that `queue` is not empty. In this case we need to link previous entry with the new. While waitaing for the MCS lock, the next pointer may have been set by another lock waiter. We optimistically load the next pointer & prefetch the cacheline for writing to reduce latency in the upcoming MCS unlock operation:
|
||||
|
||||
```C
|
||||
if (old & _Q_TAIL_MASK) {
|
||||
prev = decode_tail(old);
|
||||
WRITE_ONCE(prev->next, node);
|
||||
|
||||
arch_mcs_spin_lock_contended(&node->locked);
|
||||
|
||||
next = READ_ONCE(node->next);
|
||||
if (next)
|
||||
prefetchw(next);
|
||||
}
|
||||
```
|
||||
|
||||
If the new node was added, we prefetch cache line from memory pointed by the next queue entry with the [PREFETCHW](http://www.felixcloutier.com/x86/PREFETCHW.html) instruction. We preload this pointer now for optimization purpose. We just became a head of queue and this means that there is upcoming `MCS` unlock operation and the next entry will be touched.
|
||||
|
||||
Yes, from this moment we are in the head of the `queue`. But before we are able to acquire a lock, we need to wait at least two events: current owner of a lock will release it and the second thread with `pending` bit will acquire a lock too:
|
||||
|
||||
```C
|
||||
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
|
||||
```
|
||||
|
||||
After both threads will release a lock, the head of the `queue` will hold a lock. In the end we just need to update the tail of the `queue` and remove current head from it.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock`. In this part we saw another implementation of the `spinlock` mechanism - `queued spinlock`. In the next part we will continue to dive into synchronization primitives in the Linux kernel.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [Test and Set](https://en.wikipedia.org/wiki/Test-and-set)
|
||||
* [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
|
||||
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [atomic instruction](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html)
|
||||
* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [NOP instruction](https://en.wikipedia.org/wiki/NOP)
|
||||
* [PREFETCHW instruction](http://www.felixcloutier.com/x86/PREFETCHW.html)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html)
|
@ -1,487 +0,0 @@
|
||||
Synchronization primitives in the Linux kernel. Part 2.
|
||||
================================================================================
|
||||
|
||||
Queued Spinlocks
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the second part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the first [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter we met the first - [spinlock](https://en.wikipedia.org/wiki/Spinlock). We will continue to learn this synchronization primitive in this part. If you have read the previous part, you may remember that besides normal spinlocks, the Linux kernel provides special type of `spinlocks` - `queued spinlocks`. In this part we will try to understand what does this concept represent.
|
||||
|
||||
We saw [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `spinlock` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html):
|
||||
|
||||
* `spin_lock_init` - produces initialization of the given `spinlock`;
|
||||
* `spin_lock` - acquires given `spinlock`;
|
||||
* `spin_lock_bh` - disables software [interrupts](https://en.wikipedia.org/wiki/Interrupt) and acquire given `spinlock`.
|
||||
* `spin_lock_irqsave` and `spin_lock_irq` - disable interrupts on local processor and preserve/not preserve previous interrupt state in the `flags`;
|
||||
* `spin_unlock` - releases given `spinlock`;
|
||||
* `spin_unlock_bh` - releases given `spinlock` and enables software interrupts;
|
||||
* `spin_is_locked` - returns the state of the given `spinlock`;
|
||||
* and etc.
|
||||
|
||||
And we know that all of these macro which are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file will be expanded to the call of the functions with `arch_spin_.*` prefix from the [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/spinlock.h) for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. If we will look at this header fill with attention, we will that these functions (`arch_spin_is_locked`, `arch_spin_lock`, `arch_spin_unlock` and etc) defined only if the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option is disabled:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
#include <asm/qspinlock.h>
|
||||
#else
|
||||
static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
This means that the [arch/x86/include/asm/qspinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/qspinlock.h) header file provides own implementation of these functions. Actually they are macros and they are located in other header file. This header file is - [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126). If we will look into this header file, we will find definition of these macros:
|
||||
|
||||
```C
|
||||
#define arch_spin_is_locked(l) queued_spin_is_locked(l)
|
||||
#define arch_spin_is_contended(l) queued_spin_is_contended(l)
|
||||
#define arch_spin_value_unlocked(l) queued_spin_value_unlocked(l)
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
#define arch_spin_trylock(l) queued_spin_trylock(l)
|
||||
#define arch_spin_unlock(l) queued_spin_unlock(l)
|
||||
#define arch_spin_lock_flags(l, f) queued_spin_lock(l)
|
||||
#define arch_spin_unlock_wait(l) queued_spin_unlock_wait(l)
|
||||
```
|
||||
|
||||
Before we will consider how queued spinlocks and their [API](https://en.wikipedia.org/wiki/Application_programming_interface) are implemented, we take a look on theoretical part at first.
|
||||
|
||||
Introduction to queued spinlocks
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
Queued spinlocks is a [locking mechanism](https://en.wikipedia.org/wiki/Lock_%28computer_science%29) in the Linux kernel which is replacement for the standard `spinlocks`. At least this is true for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. If we will look at the following kernel configuration file - [kernel/Kconfig.locks](https://github.com/torvalds/linux/blob/master/kernel/Kconfig.locks), we will see following configuration entries:
|
||||
|
||||
```
|
||||
config ARCH_USE_QUEUED_SPINLOCKS
|
||||
bool
|
||||
|
||||
config QUEUED_SPINLOCKS
|
||||
def_bool y if ARCH_USE_QUEUED_SPINLOCKS
|
||||
depends on SMP
|
||||
```
|
||||
|
||||
This means that the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option will be enabled by default if the `ARCH_USE_QUEUED_SPINLOCKS` is enabled. We may see that the `ARCH_USE_QUEUED_SPINLOCKS` is enabled by default in the `x86_64` specific kernel configuration file - [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig):
|
||||
|
||||
```
|
||||
config X86
|
||||
...
|
||||
...
|
||||
...
|
||||
select ARCH_USE_QUEUED_SPINLOCKS
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Before we will start to consider what is it queued spinlock concept, let's look on other types of `spinlocks`. For the start let's consider how `normal` spinlocks is implemented. Usually, implementation of `normal` spinlock is based on the [test and set](https://en.wikipedia.org/wiki/Test-and-set) instruction. Principle of work of this instruction is pretty simple. This instruction writes a value to the memory location and returns old value from this memory location. Both of these operations are in atomic context i.e. this instruction is non-interruptible. So if the first thread started to execute this instruction, second thread will wait until the first processor will not finish. Basic lock can be built on top of this mechanism. Schematically it may look like this:
|
||||
|
||||
```C
|
||||
int lock(lock)
|
||||
{
|
||||
while (test_and_set(lock) == 1)
|
||||
;
|
||||
return 0;
|
||||
}
|
||||
|
||||
int unlock(lock)
|
||||
{
|
||||
lock=0;
|
||||
|
||||
return lock;
|
||||
}
|
||||
```
|
||||
|
||||
The first thread will execute the `test_and_set` which will set the `lock` to `1`. When the second thread will call the `lock` function, it will spin in the `while` loop, until the first thread will not call the `unlock` function and the `lock` will be equal to `0`. This implementation is not very good for performance, because it has at least two problems. The first problem is that this implementation may be unfair and the thread from one processor may have long waiting time, even if it called the `lock` before other threads which are waiting for free lock too. The second problem is that all threads which want to acquire a lock, must to execute many `atomic` operations like `test_and_set` on a variable which is in shared memory. This leads to the cache invalidation as the cache of the processor will store `lock=1`, but the value of the `lock` in memory may be `1` after a thread will release this lock.
|
||||
|
||||
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we saw the second type of spinlock implementation - `ticket spinlock`. This approach solves the first problem and may guarantee order of threads which want to acquire a lock, but still has a second problem.
|
||||
|
||||
The topic of this part is `queued spinlocks`. This approach may help to solve both of these problems. The `queued spinlocks` allows to each processor to use its own memory location to spin. The basic principle of a queue-based spinlock can best be understood by studying a classic queue-based spinlock implementation called the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) lock. Before we will look at implementation of the `queued spinlocks` in the Linux kernel, we will try to understand what is it `MCS` lock.
|
||||
|
||||
The basic idea of the `MCS` lock is in that as I already wrote in the previous paragraph, a thread spins on a local variable and each processor in the system has its own copy of these variable. In other words this concept is built on top of the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variables concept in the Linux kernel.
|
||||
|
||||
When the first thread wants to acquire a lock, it registers itself in the `queue` or in other words it will be added to the special `queue` and will acquire lock, because it is free for now. When the second thread will want to acquire the same lock before the first thread will release it, this thread adds its own copy of the lock variable into this `queue`. In this case the first thread will contain a `next` field which will point to the second thread. From this moment, the second thread will wait until the first thread will release its lock and notify `next` thread about this event. The first thread will be deleted from the `queue` and the second thread will be owner of a lock.
|
||||
|
||||
Schematically we can represent it like:
|
||||
|
||||
Empty queue:
|
||||
|
||||
```
|
||||
+---------+
|
||||
| |
|
||||
| Queue |
|
||||
| |
|
||||
+---------+
|
||||
```
|
||||
|
||||
First thread tries to acquire a lock:
|
||||
|
||||
```
|
||||
+---------+ +----------------------------+
|
||||
| | | |
|
||||
| Queue |---->| First thread acquired lock |
|
||||
| | | |
|
||||
+---------+ +----------------------------+
|
||||
```
|
||||
|
||||
Second thread tries to acquire a lock:
|
||||
|
||||
```
|
||||
+---------+ +----------------------------------------+ +-------------------------+
|
||||
| | | | | |
|
||||
| Queue |---->| Second thread waits for first thread |<----| First thread holds lock |
|
||||
| | | | | |
|
||||
+---------+ +----------------------------------------+ +-------------------------+
|
||||
```
|
||||
|
||||
Or the pseudocode:
|
||||
|
||||
```C
|
||||
void lock(...)
|
||||
{
|
||||
lock.next = NULL;
|
||||
ancestor = put_lock_to_queue_and_return_ancestor(queue, lock);
|
||||
|
||||
// if we have ancestor, the lock already acquired and we
|
||||
// need to wait until it will be released
|
||||
if (ancestor)
|
||||
{
|
||||
lock.locked = 1;
|
||||
ancestor.next = lock;
|
||||
|
||||
while (lock.is_locked == true)
|
||||
;
|
||||
}
|
||||
|
||||
// in other way we are owner of the lock and may exit
|
||||
}
|
||||
|
||||
void unlock(...)
|
||||
{
|
||||
// do we need to notify somebody or we are alonw in the
|
||||
// queue?
|
||||
if (lock.next != NULL) {
|
||||
// the while loop from the lock() function will be
|
||||
// finished
|
||||
lock.next.is_locked = false;
|
||||
// delete ourself from the queue and exit
|
||||
...
|
||||
...
|
||||
...
|
||||
return;
|
||||
}
|
||||
|
||||
// So, we have no next threads in the queue to notify about
|
||||
// lock releasing event. Let's just put `0` to the lock, will
|
||||
// delete ourself from the queue and exit.
|
||||
}
|
||||
```
|
||||
|
||||
The idea is simple, but the implementation of the `queued spinlocks` is must complex than this pseudocode. As I already wrote above, the `queued spinlock` mechanism is planned to be replacement for `ticket spinlocks` in the Linux kernel. But as you may remember, the usual `spinlock` fit into `32-bit` [word](https://en.wikipedia.org/wiki/Word_%28computer_architecture%29). But the `MCS` based lock does not fit to this size. As you may know `spinlock_t` type is [widely](http://lxr.free-electrons.com/ident?i=spinlock_t) used in the Linux kernel. In this case would have to rewrite a significant part of the Linux kernel, but this is unacceptable. Beside this, some kernel structures which contains a spinlock for protection can't grow. But anyway, implementation of the `queued spinlocks` in the Linux kernel based on this concept with some modifications which allows to fit it into `32` bits.
|
||||
|
||||
That's all about theory of the `queued spinlocks`, now let's consider how this mechanism is implemented in the Linux kernel. Implementation of the `queued spinlocks` looks more complex and tangled than implementation of `ticket spinlocks`, but the study with attention will lead to success.
|
||||
|
||||
API of queued spinlocks
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
Now we know a little about `queued spinlocks` from the theoretical side, time to see the implementation of this mechanism in the Linux kernel. As we saw above, the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126) header files provides a set of macro which are represent API for a spinlock acquiring, releasing and etc:
|
||||
|
||||
```C
|
||||
#define arch_spin_is_locked(l) queued_spin_is_locked(l)
|
||||
#define arch_spin_is_contended(l) queued_spin_is_contended(l)
|
||||
#define arch_spin_value_unlocked(l) queued_spin_value_unlocked(l)
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
#define arch_spin_trylock(l) queued_spin_trylock(l)
|
||||
#define arch_spin_unlock(l) queued_spin_unlock(l)
|
||||
#define arch_spin_lock_flags(l, f) queued_spin_lock(l)
|
||||
#define arch_spin_unlock_wait(l) queued_spin_unlock_wait(l)
|
||||
```
|
||||
|
||||
All of these macros expand to the call of functions from the same header file. Additionally, we saw the `qspinlock` structure from the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file which represents a queued spinlock in the Linux kernel:
|
||||
|
||||
```C
|
||||
typedef struct qspinlock {
|
||||
atomic_t val;
|
||||
} arch_spinlock_t;
|
||||
```
|
||||
|
||||
As we may see, the `qspinlock` structure contains only one field - `val`. This field represents the state of a given `spinlock`. This `4` bytes field consists from following four parts:
|
||||
|
||||
* `0-7` - locked byte;
|
||||
* `8` - pending bit;
|
||||
* `16-17` - two bit index which represents entry of the `per-cpu` array of the `MCS` lock (will see it soon);
|
||||
* `18-31` - contains number of processor which indicates tail of the queue.
|
||||
|
||||
and the `9-15` bytes are not used.
|
||||
|
||||
As we already know, each processor in the system has own copy of the lock. The lock is represented by the following structure:
|
||||
|
||||
```C
|
||||
struct mcs_spinlock {
|
||||
struct mcs_spinlock *next;
|
||||
int locked;
|
||||
int count;
|
||||
};
|
||||
```
|
||||
|
||||
from the [kernel/locking/mcs_spinlock.h](https://github.com/torvalds/linux/blob/master/kernel/locking/mcs_spinlock.h) header file. The first field represents a pointer to the next thread in the `queue`. The second field represents the state of the current thread in the `queue`, where `1` is `lock` already acquired and `0` in other way. And the last field of the `mcs_spinlock` structure represents nested locks. To understand what is it nested lock, imagine situation when a thread acquired lock, but was interrupted by the hardware [interrupt](https://en.wikipedia.org/wiki/Interrupt) and an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) tries to take a lock too. For this case, each processor has not just copy of the `mcs_spinlock` structure but array of these structures:
|
||||
|
||||
```C
|
||||
static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
|
||||
```
|
||||
|
||||
This array allows to make four attempts of a lock acquisition for the four events in following contexts:
|
||||
|
||||
* normal task context;
|
||||
* hardware interrupt context;
|
||||
* software interrupt context;
|
||||
* non-maskable interrupt context.
|
||||
|
||||
Now let's return to the `qspinlock` structure and the `API` of the `queued spinlocks`. Before we will move to consider `API` of `queued spinlocks`, notice the `val` field of the `qspinlock` structure has type - `atomic_t` which represents atomic variable or one operation at a time variable. So, all operations with this field will be [atomic](https://en.wikipedia.org/wiki/Linearizability). For example let's look at the reading value of the `val` API:
|
||||
|
||||
```C
|
||||
static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
|
||||
{
|
||||
return atomic_read(&lock->val);
|
||||
}
|
||||
```
|
||||
|
||||
Ok, now we know data structures which represents queued spinlock in the Linux kernel and now time is to look at the implementation of the `main` function from the `queued spinlocks` [API](https://en.wikipedia.org/wiki/Application_programming_interface).
|
||||
|
||||
```C
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
```
|
||||
|
||||
Yes, this function is - `queued_spin_lock`. As we may understand from the function's name, it allows to acquire lock by the thread. This function is defined in the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file and its implementation looks:
|
||||
|
||||
```C
|
||||
static __always_inline void queued_spin_lock(struct qspinlock *lock)
|
||||
{
|
||||
u32 val;
|
||||
|
||||
val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
|
||||
if (likely(val == 0))
|
||||
return;
|
||||
queued_spin_lock_slowpath(lock, val);
|
||||
}
|
||||
```
|
||||
|
||||
Looks pretty easy, except the `queued_spin_lock_slowpath` function. We may see that it takes only one parameter. In our case this parameter will represent `queued spinlock` which will be locked. Let's consider the situation that `queue` with locks is empty for now and the first thread wanted to acquire lock. As we may see the `queued_spin_lock` function starts from the call of the `atomic_cmpxchg_acquire` macro. As you may guess from the name of this macro, it executes atomic [CMPXCHG](http://x86.renejeschke.de/html/file_module_x86_id_41.html) instruction which compares value of the second parameter (zero in our case) with the value of the first parameter (current state of the given spinlock) and if they are identical, it stores value of the `_Q_LOCKED_VAL` in the memory location which is pointed by the `&lock->val` and return the initial value from this memory location.
|
||||
|
||||
The `atomic_cmpxchg_acquire` macro is defined in the [include/linux/atomic.h](https://github.com/torvalds/linux/blob/master/include/linux/atomic.h) header file and expands to the call of the `atomic_cmpxchg` function:
|
||||
|
||||
```C
|
||||
#define atomic_cmpxchg_acquire atomic_cmpxchg
|
||||
```
|
||||
|
||||
which is architecture specific. We consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so in our case this header file will be [arch/x86/include/asm/atomic.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/atomic.h) and the implementation of the `atomic_cmpxchg` function is just returns the result of the `cmpxchg` macro:
|
||||
|
||||
```C
|
||||
static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new)
|
||||
{
|
||||
return cmpxchg(&v->counter, old, new);
|
||||
}
|
||||
```
|
||||
|
||||
This macro is defined in the [arch/x86/include/asm/cmpxchg.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/cmpxchg.h) header file and looks:
|
||||
|
||||
```C
|
||||
#define cmpxchg(ptr, old, new) \
|
||||
__cmpxchg(ptr, old, new, sizeof(*(ptr)))
|
||||
|
||||
#define __cmpxchg(ptr, old, new, size) \
|
||||
__raw_cmpxchg((ptr), (old), (new), (size), LOCK_PREFIX)
|
||||
```
|
||||
|
||||
As we may see, the `cmpxchg` macro expands to the `__cpmxchg` macro with the almost the same set of parameters. New additional parameter is the size of the atomic value. The `__cmpxchg` macro adds `LOCK_PREFIX` and expands to the `__raw_cmpxchg` macro where `LOCK_PREFIX` just [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction. After all, the `__raw_cmpxchg` does all job for us:
|
||||
|
||||
```C
|
||||
#define __raw_cmpxchg(ptr, old, new, size, lock) \
|
||||
({
|
||||
...
|
||||
...
|
||||
...
|
||||
volatile u32 *__ptr = (volatile u32 *)(ptr); \
|
||||
asm volatile(lock "cmpxchgl %2,%1" \
|
||||
: "=a" (__ret), "+m" (*__ptr) \
|
||||
: "r" (__new), "" (__old) \
|
||||
: "memory"); \
|
||||
...
|
||||
...
|
||||
...
|
||||
})
|
||||
```
|
||||
|
||||
After the `atomic_cmpxchg_acquire` macro will be executed, it returns the previous value of the memory location. Now only one thread tried to acquire a lock, so the `val` will be zero and we will return from the `queued_spin_lock` function:
|
||||
|
||||
```C
|
||||
val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
|
||||
if (likely(val == 0))
|
||||
return;
|
||||
```
|
||||
|
||||
From this moment, our first thread will hold a lock. Notice that this behavior differs from the behavior which was described in the `MCS` algorithm. The thread acquired lock, but we didn't add it to the `queue`. As I already wrote the implementation of `queued spinlocks` concept is based on the `MCS` algorithm in the Linux kernel, but in the same time it has some difference like this for optimization purpose.
|
||||
|
||||
So the first thread have acquired lock and now let's consider that the second thread tried to acquire the same lock. The second thread will start from the same `queued_spin_lock` function, but the `lock->val` will contain `1` or `_Q_LOCKED_VAL`, because first thread already holds lock. So, in this case the `queued_spin_lock_slowpath` function will be called. The `queued_spin_lock_slowpath` function is defined in the [kernel/locking/qspinlock.c](https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c) source code file and starts from the following checks:
|
||||
|
||||
```C
|
||||
void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
|
||||
{
|
||||
if (pv_enabled())
|
||||
goto queue;
|
||||
|
||||
if (virt_spin_lock(lock))
|
||||
return;
|
||||
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
which check the state of the `pvqspinlock`. The `pvqspinlock` is `queued spinlock` in [paravirtualized](https://en.wikipedia.org/wiki/Paravirtualization) environment. As this chapter is related only to synchronization primitives in the Linux kernel, we skip these and other parts which are not directly related to the topic of this chapter. After these checks we compare our value which represents lock with the value of the `_Q_PENDING_VAL` macro and do nothing while this is true:
|
||||
|
||||
```C
|
||||
if (val == _Q_PENDING_VAL) {
|
||||
while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
|
||||
cpu_relax();
|
||||
}
|
||||
```
|
||||
|
||||
where `cpu_relax` is just [NOP](https://en.wikipedia.org/wiki/NOP) instruction. Above, we saw that the lock contains - `pending` bit. This bit represents thread which wanted to acquire lock, but it is already acquired by the other thread and in the same time `queue` is empty. In this case, the `pending` bit will be set and the `queue` will not be touched. This is done for optimization, because there are no need in unnecessary latency which will be caused by the cache invalidation in a touching of own `mcs_spinlock` array.
|
||||
|
||||
At the next step we enter into the following loop:
|
||||
|
||||
```C
|
||||
for (;;) {
|
||||
if (val & ~_Q_LOCKED_MASK)
|
||||
goto queue;
|
||||
|
||||
new = _Q_LOCKED_VAL;
|
||||
if (val == new)
|
||||
new |= _Q_PENDING_VAL;
|
||||
|
||||
old = atomic_cmpxchg_acquire(&lock->val, val, new);
|
||||
if (old == val)
|
||||
break;
|
||||
|
||||
val = old;
|
||||
}
|
||||
```
|
||||
|
||||
The first `if` clause here checks that state of the lock (`val`) is in locked or pending state. This means that first thread already acquired lock, second thread tried to acquire lock too, but now it is in pending state. In this case we need to start to build queue. We will consider this situation little later. In our case we are first thread holds lock and the second thread tries to do it too. After this check we create new lock in a locked state and compare it with the state of the previous lock. As you remember, the `val` contains state of the `&lock->val` which after the second thread will call the `atomic_cmpxchg_acquire` macro will be equal to `1`. Both `new` and `val` values are equal so we set pending bit in the lock of the second thread. After this we need to check value of the `&lock->val` again, because the first thread may release lock before this moment. If the first thread did not released lock yet, the value of the `old` will be equal to the value of the `val` (because `atomic_cmpxchg_acquire` will return the value from the memory location which is pointed by the `lock->val` and now it is `1`) and we will exit from the loop. As we exited from this loop, we are waiting for the first thread until it will release lock, clear pending bit, acquire lock and return:
|
||||
|
||||
```C
|
||||
smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK));
|
||||
clear_pending_set_locked(lock);
|
||||
return;
|
||||
```
|
||||
|
||||
Notice that we did not touch `queue` yet. We no need in it, because for two threads it just leads to unnecessary latency for memory access. In other case, the first thread may release it lock before this moment. In this case the `lock->val` will contain `_Q_LOCKED_VAL | _Q_PENDING_VAL` and we will start to build `queue`. We start to build `queue` by the getting the local copy of the `mcs_nodes` array of the processor which executes thread:
|
||||
|
||||
```C
|
||||
node = this_cpu_ptr(&mcs_nodes[0]);
|
||||
idx = node->count++;
|
||||
tail = encode_tail(smp_processor_id(), idx);
|
||||
```
|
||||
|
||||
Additionally we calculate `tail` which will indicate the tail of the `queue` and `index` which represents an entry of the `mcs_nodes` array. After this we set the `node` to point to the correct of the `mcs_nodes` array, set `locked` to zero because this thread didn't acquire lock yet and `next` to `NULL` because we don't know anything about other `queue` entries:
|
||||
|
||||
```C
|
||||
node += idx;
|
||||
node->locked = 0;
|
||||
node->next = NULL;
|
||||
```
|
||||
|
||||
We already touch `per-cpu` copy of the queue for the processor which executes current thread which wants to acquire lock, this means that owner of the lock may released it before this moment. So we may try to acquire lock again by the call of the `queued_spin_trylock` function.
|
||||
|
||||
```C
|
||||
if (queued_spin_trylock(lock))
|
||||
goto release;
|
||||
```
|
||||
|
||||
The `queued_spin_trylock` function is defined in the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h) header file and just does the same `queued_spin_lock` function that does:
|
||||
|
||||
```C
|
||||
static __always_inline int queued_spin_trylock(struct qspinlock *lock)
|
||||
{
|
||||
if (!atomic_read(&lock->val) &&
|
||||
(atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
|
||||
return 1;
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
If the lock was successfully acquired we jump to the `release` label to release a node of the `queue`:
|
||||
|
||||
```C
|
||||
release:
|
||||
this_cpu_dec(mcs_nodes[0].count);
|
||||
```
|
||||
|
||||
because we no need in it anymore as lock is acquired. If the `queued_spin_trylock` was unsuccessful, we update tail of the queue:
|
||||
|
||||
```C
|
||||
old = xchg_tail(lock, tail);
|
||||
```
|
||||
|
||||
and retrieve previous tail. The next step is to check that `queue` is not empty. In this case we need to link previous entry with the new:
|
||||
|
||||
```C
|
||||
if (old & _Q_TAIL_MASK) {
|
||||
prev = decode_tail(old);
|
||||
WRITE_ONCE(prev->next, node);
|
||||
|
||||
arch_mcs_spin_lock_contended(&node->locked);
|
||||
}
|
||||
```
|
||||
|
||||
After queue entries linked, we start to wait until reaching the head of queue. As we As we reached this, we need to do a check for new node which might be added during this wait:
|
||||
|
||||
```C
|
||||
next = READ_ONCE(node->next);
|
||||
if (next)
|
||||
prefetchw(next);
|
||||
```
|
||||
|
||||
If the new node was added, we prefetch cache line from memory pointed by the next queue entry with the [PREFETCHW](http://www.felixcloutier.com/x86/PREFETCHW.html) instruction. We preload this pointer now for optimization purpose. We just became a head of queue and this means that there is upcoming `MCS` unlock operation and the next entry will be touched.
|
||||
|
||||
Yes, from this moment we are in the head of the `queue`. But before we are able to acquire a lock, we need to wait at least two events: current owner of a lock will release it and the second thread with `pending` bit will acquire a lock too:
|
||||
|
||||
```C
|
||||
smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
|
||||
```
|
||||
|
||||
After both threads will release a lock, the head of the `queue` will hold a lock. In the end we just need to update the tail of the `queue` and remove current head from it.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock`. In this part we saw another implementation of the `spinlock` mechanism - `queued spinlock`. In the next part we will continue to dive into synchronization primitives in the Linux kernel.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [Test and Set](https://en.wikipedia.org/wiki/Test-and-set)
|
||||
* [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
|
||||
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [atomic instruction](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html)
|
||||
* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [NOP instruction](https://en.wikipedia.org/wiki/NOP)
|
||||
* [PREFETCHW instruction](http://www.felixcloutier.com/x86/PREFETCHW.html)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
@ -0,0 +1,221 @@
|
||||
Limits on resources in Linux
|
||||
================================================================================
|
||||
|
||||
Each process in the system uses certain amount of different resources like files, CPU time, memory and so on.
|
||||
|
||||
Such resources are not infinite and each process and we should have an instrument to manage it. Sometimes it is useful to know current limits for a certain resource or to change it's value. In this post we will consider such instruments that allow us to get information about limits for a process and increase or decrease such limits.
|
||||
|
||||
We will start from userspace view and then we will look how it is implemented in the Linux kernel.
|
||||
|
||||
There are three main fundamental [system calls](https://en.wikipedia.org/wiki/System_call) to manage resource limit for a process:
|
||||
|
||||
* `getrlimit`
|
||||
* `setrlimit`
|
||||
* `prlimit`
|
||||
|
||||
The first two allows a process to read and set limits on a system resource. The last one is extension for previous functions. The `prlimit` allows to set and read the resource limits of a process specified by [PID](https://en.wikipedia.org/wiki/Process_identifier). Definitions of these functions looks:
|
||||
|
||||
The `getrlimit` is:
|
||||
|
||||
```C
|
||||
int getrlimit(int resource, struct rlimit *rlim);
|
||||
```
|
||||
|
||||
The `setrlimit` is:
|
||||
|
||||
```C
|
||||
int setrlimit(int resource, const struct rlimit *rlim);
|
||||
```
|
||||
|
||||
And the definition of the `prlimit` is:
|
||||
|
||||
```C
|
||||
int prlimit(pid_t pid, int resource, const struct rlimit *new_limit,
|
||||
struct rlimit *old_limit);
|
||||
```
|
||||
|
||||
In the first two cases, functions takes two parameters:
|
||||
|
||||
* `resource` - represents resource type (we will see available types later);
|
||||
* `rlim` - combination of `soft` and `hard` limits.
|
||||
|
||||
There are two types of limits:
|
||||
|
||||
* `soft`
|
||||
* `hard`
|
||||
|
||||
The first provides actual limit for a resource of a process. The second is a ceiling value of a `soft` limit and can be set only by superuser. So, `soft` limit can never exceed related `hard` limit.
|
||||
|
||||
Both these values are combined in the `rlimit` structure:
|
||||
|
||||
```C
|
||||
struct rlimit {
|
||||
rlim_t rlim_cur;
|
||||
rlim_t rlim_max;
|
||||
};
|
||||
```
|
||||
|
||||
The last one function looks a little bit complex and takes `4` arguments. Besides `resource` argument, it takes:
|
||||
|
||||
* `pid` - specifies an ID of a process on which the `prlimit` should be executed;
|
||||
* `new_limit` - provides new limits values if it is not `NULL`;
|
||||
* `old_limit` - current `soft` and `hard` limits will be placed here if it is not `NULL`.
|
||||
|
||||
Exactly `prlimit` function is used by [ulimit](https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html#index-ulimit) util. We can verify this with the help of [strace](https://linux.die.net/man/1/strace) util.
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
~$ strace ulimit -s 2>&1 | grep rl
|
||||
|
||||
prlimit64(0, RLIMIT_NPROC, NULL, {rlim_cur=63727, rlim_max=63727}) = 0
|
||||
prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=4*1024}) = 0
|
||||
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
|
||||
```
|
||||
|
||||
Here we can see `prlimit64`, but not the `prlimit`. The fact is that we see underlying system call here instead of library call.
|
||||
|
||||
Now let's look at list of available resources:
|
||||
|
||||
| Resource | Description
|
||||
|-------------------|------------------------------------------------------------------------------------------|
|
||||
| RLIMIT_CPU | CPU time limit given in seconds |
|
||||
| RLIMIT_FSIZE | the maximum size of files that a process may create |
|
||||
| RLIMIT_DATA | the maximum size of the process's data segment |
|
||||
| RLIMIT_STACK | the maximum size of the process stack in bytes |
|
||||
| RLIMIT_CORE | the maximum size of a [core](http://man7.org/linux/man-pages/man5/core.5.html) file. |
|
||||
| RLIMIT_RSS | the number of bytes that can be allocated for a process in RAM |
|
||||
| RLIMIT_NPROC | the maximum number of processes that can be created by a user |
|
||||
| RLIMIT_NOFILE | the maximum number of a file descriptor that can be opened by a process |
|
||||
| RLIMIT_MEMLOCK | the maximum number of bytes of memory that may be locked into RAM by [mlock](http://man7.org/linux/man-pages/man2/mlock.2.html).|
|
||||
| RLIMIT_AS | the maximum size of virtual memory in bytes. |
|
||||
| RLIMIT_LOCKS | the maximum number [flock](https://linux.die.net/man/1/flock) and locking related [fcntl](http://man7.org/linux/man-pages/man2/fcntl.2.html) calls|
|
||||
| RLIMIT_SIGPENDING | maximum number of [signals](http://man7.org/linux/man-pages/man7/signal.7.html) that may be queued for a user of the calling process|
|
||||
| RLIMIT_MSGQUEUE | the number of bytes that can be allocated for [POSIX message queues](http://man7.org/linux/man-pages/man7/mq_overview.7.html) |
|
||||
| RLIMIT_NICE | the maximum [nice](https://linux.die.net/man/1/nice) value that can be set by a process |
|
||||
| RLIMIT_RTPRIO | maximum real-time priority value |
|
||||
| RLIMIT_RTTIME | maximum number of microseconds that a process may be scheduled under real-time scheduling policy without making blocking system call|
|
||||
|
||||
If you're looking into source code of open source projects, you will note that reading or updating of a resource limit is quite widely used operation.
|
||||
|
||||
For example: [systemd](https://github.com/systemd/systemd/blob/01a45898fce8def67d51332bccc410eb1e8710e7/src/core/main.c)
|
||||
|
||||
```C
|
||||
/* Don't limit the coredump size */
|
||||
(void) setrlimit(RLIMIT_CORE, &RLIMIT_MAKE_CONST(RLIM_INFINITY));
|
||||
```
|
||||
|
||||
Or [haproxy](https://github.com/haproxy/haproxy/blob/25f067ccec52f53b0248a05caceb7841a3cb99df/src/haproxy.c):
|
||||
|
||||
```C
|
||||
getrlimit(RLIMIT_NOFILE, &limit);
|
||||
if (limit.rlim_cur < global.maxsock) {
|
||||
Warning("[%s.main()] FD limit (%d) too low for maxconn=%d/maxsock=%d. Please raise 'ulimit-n' to %d or more to avoid any trouble.\n",
|
||||
argv[0], (int)limit.rlim_cur, global.maxconn, global.maxsock, global.maxsock);
|
||||
}
|
||||
```
|
||||
|
||||
We've just saw a little bit about resources limits related stuff in the userspace, now let's look at the same system calls in the Linux kernel.
|
||||
|
||||
Limits on resource in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Both implementation of `getrlimit` system call and `setrlimit` looks similar. Both they execute `do_prlimit` function that is core implementation of the `prlimit` system call and copy from/to given `rlimit` from/to userspace:
|
||||
|
||||
The `getrlimit`:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE2(getrlimit, unsigned int, resource, struct rlimit __user *, rlim)
|
||||
{
|
||||
struct rlimit value;
|
||||
int ret;
|
||||
|
||||
ret = do_prlimit(current, resource, NULL, &value);
|
||||
if (!ret)
|
||||
ret = copy_to_user(rlim, &value, sizeof(*rlim)) ? -EFAULT : 0;
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
and `setrlimit`:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
|
||||
{
|
||||
struct rlimit new_rlim;
|
||||
|
||||
if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
|
||||
return -EFAULT;
|
||||
return do_prlimit(current, resource, &new_rlim, NULL);
|
||||
}
|
||||
```
|
||||
|
||||
Implementations of these system calls are defined in the [kernel/sys.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/sys.c) kernel source code file.
|
||||
|
||||
First of all the `do_prlimit` function executes a check that the given resource is valid:
|
||||
|
||||
```C
|
||||
if (resource >= RLIM_NLIMITS)
|
||||
return -EINVAL;
|
||||
```
|
||||
|
||||
and in a failure case returns `-EINVAL` error. After this check will pass successfully and new limits was passed as non `NULL` value, two following checks:
|
||||
|
||||
```C
|
||||
if (new_rlim) {
|
||||
if (new_rlim->rlim_cur > new_rlim->rlim_max)
|
||||
return -EINVAL;
|
||||
if (resource == RLIMIT_NOFILE &&
|
||||
new_rlim->rlim_max > sysctl_nr_open)
|
||||
return -EPERM;
|
||||
}
|
||||
```
|
||||
|
||||
check that the given `soft` limit does not exceed `hard` limit and in a case when the given resource is the maximum number of a file descriptors that hard limit is not greater than `sysctl_nr_open` value. The value of the `sysctl_nr_open` can be found via [procfs](https://en.wikipedia.org/wiki/Procfs):
|
||||
|
||||
```
|
||||
~$ cat /proc/sys/fs/nr_open
|
||||
1048576
|
||||
```
|
||||
|
||||
After all of these checks we lock `tasklist` to be sure that [signal]() handlers related things will not be destroyed while we updating limits for a given resource:
|
||||
|
||||
```C
|
||||
read_lock(&tasklist_lock);
|
||||
...
|
||||
...
|
||||
...
|
||||
read_unlock(&tasklist_lock);
|
||||
```
|
||||
|
||||
We need to do this because `prlimit` system call allows us to update limits of another task by the given pid. As task list is locked, we take the `rlimit` instance that is responsible for the given resource limit of the given process:
|
||||
|
||||
```C
|
||||
rlim = tsk->signal->rlim + resource;
|
||||
```
|
||||
|
||||
where the `tsk->signal->rlim` is just array of `struct rlimit` that represents certain resources. And if the `new_rlim` is not `NULL` we just update its value. If `old_rlim` is not `NULL` we fill it:
|
||||
|
||||
```C
|
||||
if (old_rlim)
|
||||
*old_rlim = *rlim;
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part that describes implementation of the system calls in the Linux kernel. If you have questions or suggestions, ping me on Twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-internals/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system calls](https://en.wikipedia.org/wiki/System_call)
|
||||
* [PID](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [ulimit](https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html#index-ulimit)
|
||||
* [strace](https://linux.die.net/man/1/strace)
|
||||
* [POSIX message queues](http://man7.org/linux/man-pages/man7/mq_overview.7.html)
|
@ -1,14 +0,0 @@
|
||||
# Interrupts and Interrupt Handling
|
||||
|
||||
In the following posts, we will cover interrupts and exceptions handling in the linux kernel.
|
||||
|
||||
* [Interrupts and Interrupt Handling. Part 1.](interrupts-1.md) - describes interrupts and interrupt handling theory.
|
||||
* [Interrupts in the Linux Kernel](interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
|
||||
* [Early interrupt handlers](interrupts-3.md) - describes early interrupt handlers.
|
||||
* [Interrupt handlers](interrupts-4.md) - describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
|
||||
* [Handling non-maskable interrupts](interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
|
||||
* [External hardware interrupts](interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
|
||||
* [Last part](interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.
|
Loading…
Reference in New Issue