commit
b6a2853f2c
@ -0,0 +1,5 @@
|
||||
# Cgroups
|
||||
|
||||
This chapter describes `control groups` mechanism in the Linux kernel.
|
||||
|
||||
* [Introduction](http://0xax.gitbooks.io/linux-insides/content/Cgroups/cgroups1.html)
|
@ -0,0 +1,449 @@
|
||||
Control Groups
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the first part of the new chapter of the [linux insides](http://0xax.gitbooks.io/linux-insides/content/) book and as you may guess by part's name - this part will cover [control groups](https://en.wikipedia.org/wiki/Cgroups) or `cgroups` mechanism in the Linux kernel.
|
||||
|
||||
`Cgroups` are special mechanism provided by the Linux kernel which allows us to allocate kind of `resources` like processor time, number of processes per group, amount of memory per control group or combination of such resources for a process or set of processes. `Cgroups` are organized hierarchically and here this mechanism is similar to usual processes as they are hierarchical too and child `cgroups` inherit set of certain parameters from their parents. But actually they are not the same. The main differences between `cgroups` and normal processes that many different hierarchies of control groups may exist simultaneously in one time while normal process three is always single. This was not a casual step because each control group hierarchy is attached to set of control group `subsystems`.
|
||||
|
||||
One `control group subsystem` represents one kind of resources like a processor time or number of [pids](https://en.wikipedia.org/wiki/Process_identifier) or in other words number of processes for a `control group`. Linux kernel provides support for following twelve `control group subsystems`:
|
||||
|
||||
* `cpuset` - assigns individual processor(s) and memory nodes to task(s) in a group;
|
||||
* `cpu` - uses the scheduler to provide cgroup tasks access to the processor resources;
|
||||
* `cpuacct` - generates reports about processor usage by a group;
|
||||
* `io` - sets limit to read/write from/to [block devices](https://en.wikipedia.org/wiki/Device_file);
|
||||
* `memory` - sets limit on memory usage by a task(s) from a group;
|
||||
* `devices` - allows access to devices by a task(s) from a group;
|
||||
* `freezer` - allows to suspend/resume for a task(s) from a group;
|
||||
* `net_cls` - allows to mark network packets from task(s) from a group;
|
||||
* `net_prio` - provides a way to dynamically set the priority of network traffic per network interface for a group;
|
||||
* `perf_event` - provides access to [perf events](https://en.wikipedia.org/wiki/Perf_(Linux)) to a group;
|
||||
* `hugetlb` - activates support for [huge pages](https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt) for a group;
|
||||
* `pid` - sets limit to number of processes in a group.
|
||||
|
||||
Each of these control group subsystems depends on related configuration option. For example the `cpuset` subsystem should be enabled via `CONFIG_CPUSETS` kernel configuration option, the `io` subsystem via `CONFIG_BLK_CGROUP` kernel configuration option and etc. All of these kernel configuration options may be found in the `General setup → Control Group support` menu:
|
||||
|
||||
![menuconfig](http://oi66.tinypic.com/2rc2a9e.jpg)
|
||||
|
||||
You may see enabled control groups on your computer via [proc](https://en.wikipedia.org/wiki/Procfs) filesystem:
|
||||
|
||||
```
|
||||
$ cat /proc/cgroups
|
||||
#subsys_name hierarchy num_cgroups enabled
|
||||
cpuset 8 1 1
|
||||
cpu 7 66 1
|
||||
cpuacct 7 66 1
|
||||
blkio 11 66 1
|
||||
memory 9 94 1
|
||||
devices 6 66 1
|
||||
freezer 2 1 1
|
||||
net_cls 4 1 1
|
||||
perf_event 3 1 1
|
||||
net_prio 4 1 1
|
||||
hugetlb 10 1 1
|
||||
pids 5 69 1
|
||||
```
|
||||
|
||||
or via [sysfs](https://en.wikipedia.org/wiki/Sysfs):
|
||||
|
||||
```
|
||||
$ ls -l /sys/fs/cgroup/
|
||||
total 0
|
||||
dr-xr-xr-x 5 root root 0 Dec 2 22:37 blkio
|
||||
lrwxrwxrwx 1 root root 11 Dec 2 22:37 cpu -> cpu,cpuacct
|
||||
lrwxrwxrwx 1 root root 11 Dec 2 22:37 cpuacct -> cpu,cpuacct
|
||||
dr-xr-xr-x 5 root root 0 Dec 2 22:37 cpu,cpuacct
|
||||
dr-xr-xr-x 2 root root 0 Dec 2 22:37 cpuset
|
||||
dr-xr-xr-x 5 root root 0 Dec 2 22:37 devices
|
||||
dr-xr-xr-x 2 root root 0 Dec 2 22:37 freezer
|
||||
dr-xr-xr-x 2 root root 0 Dec 2 22:37 hugetlb
|
||||
dr-xr-xr-x 5 root root 0 Dec 2 22:37 memory
|
||||
lrwxrwxrwx 1 root root 16 Dec 2 22:37 net_cls -> net_cls,net_prio
|
||||
dr-xr-xr-x 2 root root 0 Dec 2 22:37 net_cls,net_prio
|
||||
lrwxrwxrwx 1 root root 16 Dec 2 22:37 net_prio -> net_cls,net_prio
|
||||
dr-xr-xr-x 2 root root 0 Dec 2 22:37 perf_event
|
||||
dr-xr-xr-x 5 root root 0 Dec 2 22:37 pids
|
||||
dr-xr-xr-x 5 root root 0 Dec 2 22:37 systemd
|
||||
```
|
||||
|
||||
As you already may guess that `control groups` mechanism is not such mechanism which was invented only directly to the needs of the Linux kernel, but mostly for userspace needs. To use a `control group`, we should create it at first. We may create a `cgroup` via two ways.
|
||||
|
||||
The first way is to create subdirectory in any subsystem from `sys/fs/cgroup` and add a pid of a task to a `tasks` file which will be created automatically right after we will create the subdirectory.
|
||||
|
||||
The second way is to create/destroy/manage `cgroups` with utils from `libcgroup` library (`libcgroup-tools` in Fedora).
|
||||
|
||||
Let's consider simple example. Following [bash](https://www.gnu.org/software/bash/) script will print a line to `/dev/tty` device which represents control terminal for the current process:
|
||||
|
||||
```shell
|
||||
#!/bin/bash
|
||||
|
||||
while :
|
||||
do
|
||||
echo "print line" > /dev/tty
|
||||
sleep 5
|
||||
done
|
||||
```
|
||||
|
||||
So, if we will run this script we will see following result:
|
||||
|
||||
```
|
||||
$ sudo chmod +x cgroup_test_script.sh
|
||||
~$ ./cgroup_test_script.sh
|
||||
print line
|
||||
print line
|
||||
print line
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Now let's go to the place where `cgroupfs` is mounted on our computer. As we just saw, this is `/sys/fs/cgroup` directory, but you may mount it everywhere you want.
|
||||
|
||||
```
|
||||
$ cd /sys/fs/cgroup
|
||||
```
|
||||
|
||||
And now let's go to the `devices` subdirectory which represents kind of resouces that allows or denies access to devices by tasks in a `cgroup`:
|
||||
|
||||
```
|
||||
# cd /devices
|
||||
```
|
||||
|
||||
and create `cgroup_test_group` directory there:
|
||||
|
||||
```
|
||||
# mkdir cgroup_test_group
|
||||
```
|
||||
|
||||
After creation of the `cgroup_test_group` directory, following files will be generated there:
|
||||
|
||||
```
|
||||
/sys/fs/cgroup/devices/cgroup_test_group$ ls -l
|
||||
total 0
|
||||
-rw-r--r-- 1 root root 0 Dec 3 22:55 cgroup.clone_children
|
||||
-rw-r--r-- 1 root root 0 Dec 3 22:55 cgroup.procs
|
||||
--w------- 1 root root 0 Dec 3 22:55 devices.allow
|
||||
--w------- 1 root root 0 Dec 3 22:55 devices.deny
|
||||
-r--r--r-- 1 root root 0 Dec 3 22:55 devices.list
|
||||
-rw-r--r-- 1 root root 0 Dec 3 22:55 notify_on_release
|
||||
-rw-r--r-- 1 root root 0 Dec 3 22:55 tasks
|
||||
```
|
||||
|
||||
For this moment we are interested in `tasks` and `devices.deny` files. The first `tasks` files should contain pid(s) of processes which will be attached to the `cgroup_test_group`. The second `devices.deny` file contain list of denied devices. By default a newly created group has no any limits for devices access. To forbid a device (in our case it is `/dev/tty`) we should write to the `devices.deny` following line:
|
||||
|
||||
```
|
||||
# echo "c 5:0 w" > devices.deny
|
||||
```
|
||||
|
||||
Let's go step by step throug this line. The first `c` letter represents type of a device. In our case the `/dev/tty` is `char device`. We can verify this from output of `ls` command:
|
||||
|
||||
```
|
||||
~$ ls -l /dev/tty
|
||||
crw-rw-rw- 1 root tty 5, 0 Dec 3 22:48 /dev/tty
|
||||
```
|
||||
|
||||
see the first `c` letter in a permissions list. The second part is `5:0` is minor and major numbers of the device. You can see these numbers in the output of `ls` too. And the last `w` letter forbids tasks to write to the specified device. So let's start the `cgroup_test_script.sh` script:
|
||||
|
||||
```
|
||||
~$ ./cgroup_test_script.sh
|
||||
print line
|
||||
print line
|
||||
print line
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
and add pid of this process to the `devices/tasks` file of our group:
|
||||
|
||||
```
|
||||
# echo $(pidof -x cgroup_test_script.sh) > /sys/fs/cgroup/devices/cgroup_test_group/tasks
|
||||
```
|
||||
|
||||
The result of this action will be as expected:
|
||||
|
||||
```
|
||||
~$ ./cgroup_test_script.sh
|
||||
print line
|
||||
print line
|
||||
print line
|
||||
print line
|
||||
print line
|
||||
print line
|
||||
./cgroup_test_script.sh: line 5: /dev/tty: Operation not permitted
|
||||
```
|
||||
|
||||
Similar situation will be when you will run you [docker](https://en.wikipedia.org/wiki/Docker_(software)) containers for example:
|
||||
|
||||
```
|
||||
~$ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
|
||||
fa2d2085cd1c mariadb:10 "docker-entrypoint..." 12 days ago Up 4 minutes 0.0.0.0:3306->3306/tcp mysql-work
|
||||
|
||||
~$ cat /sys/fs/cgroup/devices/docker/fa2d2085cd1c8d797002c77387d2061f56fefb470892f140d0dc511bd4d9bb61/tasks | head -3
|
||||
5501
|
||||
5584
|
||||
5585
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
So, during startup of a `docker` container, `docker` will create a `cgroup` for processes in this container:
|
||||
|
||||
```
|
||||
$ docker exec -it mysql-work /bin/bash
|
||||
$ top
|
||||
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 mysql 20 0 963996 101268 15744 S 0.0 0.6 0:00.46 mysqld 71 root 20 0 20248 3028 2732 S 0.0 0.0 0:00.01 bash 77 root 20 0 21948 2424 2056 R 0.0 0.0 0:00.00 top
|
||||
```
|
||||
|
||||
And we may see this `cgroup` on host machine:
|
||||
|
||||
```C
|
||||
$ systemd-cgls
|
||||
|
||||
Control group /:
|
||||
-.slice
|
||||
├─docker
|
||||
│ └─fa2d2085cd1c8d797002c77387d2061f56fefb470892f140d0dc511bd4d9bb61
|
||||
│ ├─5501 mysqld
|
||||
│ └─6404 /bin/bash
|
||||
```
|
||||
|
||||
Now we know a little about `control groups` mechanism, how to use it manually and what's purpose of this mechanism. Time to look inside of the Linux kernel source code and start to dive into implementation of this mechanism.
|
||||
|
||||
Early initialization of control groups
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Now after we just saw little theory about `control groups` Linux kernel mechanism, we may start to dive into the source code of Linux kernel to acquainted with this mechanism closer. As always we will start from the initialization of `control groups`. Initialization of `cgroups` divided into two parts in the Linux kernel: early and late. In this part we will consider only `early` part and `late` part will be considered in next parts.
|
||||
|
||||
Early initialization of `cgroups` starts from the call of the:
|
||||
|
||||
```C
|
||||
cgroup_init_early();
|
||||
```
|
||||
|
||||
function in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) during early initialization of the Linux kernel. This function is defined in the [kernel/cgroup.c](https://github.com/torvalds/linux/blob/master/kernel/cgroup.c) source code file and starts from the definition of two following local variables:
|
||||
|
||||
```C
|
||||
int __init cgroup_init_early(void)
|
||||
{
|
||||
static struct cgroup_sb_opts __initdata opts;
|
||||
struct cgroup_subsys *ss;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `cgroup_sb_opts` structure defined in the same source code file and looks:
|
||||
|
||||
```C
|
||||
struct cgroup_sb_opts {
|
||||
u16 subsys_mask;
|
||||
unsigned int flags;
|
||||
char *release_agent;
|
||||
bool cpuset_clone_children;
|
||||
char *name;
|
||||
bool none;
|
||||
};
|
||||
```
|
||||
|
||||
which represents mount options of `cgroupfs`. For example we may create named cgroup hierarchy (with name `my_cgrp`) with the `name=` option and without any subsystems:
|
||||
|
||||
```
|
||||
$ mount -t cgroup -oname=my_cgrp,none /mnt/cgroups
|
||||
```
|
||||
|
||||
The second variable - `ss` has type - `cgroup_subsys` structure which is defined in the [include/linux/cgroup-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/cgroup-defs.h) header file and as you may guess from the name of the type, it represents a `cgroup` subsystem. This structure contains various fields and callback functions like:
|
||||
|
||||
```C
|
||||
struct cgroup_subsys {
|
||||
int (*css_online)(struct cgroup_subsys_state *css);
|
||||
void (*css_offline)(struct cgroup_subsys_state *css);
|
||||
...
|
||||
...
|
||||
...
|
||||
bool early_init:1;
|
||||
int id;
|
||||
const char *name;
|
||||
struct cgroup_root *root;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Where for example `ccs_online` and `ccs_offline` callbacks are called after a cgroup successfully will complet all allocations and a cgroup will be before releasing respectively. The `early_init` flags marks subsystems which may/should be initialized early. The `id` and `name` fields represents unique identifier in the array of registered subsystems for a cgroup and `name` of a subsystem respectively. The last - `root` fields represents pointer to the root of of a cgroup hierarchy.
|
||||
|
||||
Of course the `cgroup_subsys` structure bigger and has other fields, but it is enough for now. Now as we got to know important structures related to `cgroups` mechanism, let's return to the `cgroup_init_early` function. Main purpose of this function is to do early initialization of some subsystems. As you already may guess, these `early` subsystems should have `cgroup_subsys->early_init = 1`. Let's look what subsystems may be initialized early.
|
||||
|
||||
After the definition of the two local variables we may see following lines of code:
|
||||
|
||||
```C
|
||||
init_cgroup_root(&cgrp_dfl_root, &opts);
|
||||
cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;
|
||||
```
|
||||
|
||||
Here we may see call of the `init_cgroup_root` function which will execute initialization of the default unified hierarchy and after this we set `CSS_NO_REF` flag in state of this default `cgroup` to disable reference counting for this css. The `cgrp_dfl_root` is defined in the same source code file:
|
||||
|
||||
```C
|
||||
struct cgroup_root cgrp_dfl_root;
|
||||
```
|
||||
|
||||
Its `cgrp` field represented by the `cgroup` structure which represents a `cgroup` as you already may guess and defined in the [include/linux/cgroup-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/cgroup-defs.h) header file. We already know that a process which is represented by the `task_struct` in the Linux kernel. The `task_struct` does not contain direct link to a `cgroup` where this task is attached. But it may be reached via `ccs_set` field of the `task_struct`. This `ccs_set` structure holds pointer to the array of subsystem states:
|
||||
|
||||
```C
|
||||
struct css_set {
|
||||
...
|
||||
...
|
||||
....
|
||||
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
And via the `cgroup_subsys_state`, a process may get a `cgroup` that this process is attached to:
|
||||
|
||||
```C
|
||||
struct cgroup_subsys_state {
|
||||
...
|
||||
...
|
||||
...
|
||||
struct cgroup *cgroup;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
So, the overall picture of `cgroups` related data structure is following:
|
||||
|
||||
```
|
||||
+-------------+ +---------------------+ +------------->+---------------------+ +----------------+
|
||||
| task_struct | | css_set | | | cgroup_subsys_state | | cgroup |
|
||||
+-------------+ | | | +---------------------+ +----------------+
|
||||
| | | | | | | | flags |
|
||||
| | | | | +---------------------+ | cgroup.procs |
|
||||
| | | | | | cgroup |--------->| id |
|
||||
| | | | | +---------------------+ | .... |
|
||||
|-------------+ |---------------------+----+ +----------------+
|
||||
| cgroups | ------> | cgroup_subsys_state | array of cgroup_subsys_state
|
||||
|-------------+ +---------------------+------------------>+---------------------+ +----------------+
|
||||
| | | | | cgroup_subsys_state | | cgroup |
|
||||
+-------------+ +---------------------+ +---------------------+ +----------------+
|
||||
| | | flags |
|
||||
+---------------------+ | cgroup.procs |
|
||||
| cgroup |--------->| id |
|
||||
+---------------------+ | .... |
|
||||
| cgroup_subsys | +----------------+
|
||||
+---------------------+
|
||||
|
|
||||
|
|
||||
↓
|
||||
+---------------------+
|
||||
| cgroup_subsys |
|
||||
+---------------------+
|
||||
| id |
|
||||
| name |
|
||||
| css_online |
|
||||
| css_ofline |
|
||||
| attach |
|
||||
| .... |
|
||||
+---------------------+
|
||||
```
|
||||
|
||||
|
||||
|
||||
So, the `init_cgroup_root` fills the `cgrp_dfl_root` with the default values. The next thing is assigning initial `ccs_set` to the `init_task` which represents first process in the system:
|
||||
|
||||
```C
|
||||
RCU_INIT_POINTER(init_task.cgroups, &init_css_set);
|
||||
```
|
||||
|
||||
And the last big thing in the `cgroup_init_early` function is initialization of `early cgroups`. Here we go over all registered subsystems and assign unique identity number, name of a subsystem and call the `cgroup_init_subsys` function for subsystems which are marked as early:
|
||||
|
||||
```C
|
||||
for_each_subsys(ss, i) {
|
||||
ss->id = i;
|
||||
ss->name = cgroup_subsys_name[i];
|
||||
|
||||
if (ss->early_init)
|
||||
cgroup_init_subsys(ss, true);
|
||||
}
|
||||
```
|
||||
|
||||
The `for_each_subsys` here is a macro which is defined in the [kernel/cgroup.c](https://github.com/torvalds/linux/blob/master/kernel/cgroup.c) source code file and just expands to the `for` loop over `cgroup_subsys` array. Definition of this array may be found in the same source code file and it looks in a little unusual way:
|
||||
|
||||
```C
|
||||
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
|
||||
static struct cgroup_subsys *cgroup_subsys[] = {
|
||||
#include <linux/cgroup_subsys.h>
|
||||
};
|
||||
#undef SUBSYS
|
||||
```
|
||||
|
||||
It is defined as `SUBSYS` macro which takes one argument (name of a subsystem) and defines `cgroup_subsys` array of cgroup subsystems. Additionally we may see that the array is initialized with content of the [linux/cgroup_subsys.h](https://github.com/torvalds/linux/blob/master/include/linux/cgroup_subsys.h) header file. If we will look inside of this header file we will see again set of the `SUBSYS` macros with the given subsystems names:
|
||||
|
||||
```C
|
||||
#if IS_ENABLED(CONFIG_CPUSETS)
|
||||
SUBSYS(cpuset)
|
||||
#endif
|
||||
|
||||
#if IS_ENABLED(CONFIG_CGROUP_SCHED)
|
||||
SUBSYS(cpu)
|
||||
#endif
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
This works because of `#undef` statement after first definition of the `SUBSYS` macro. Look at the `&_x ## _cgrp_subsys` expression. The `##` operator concatenates right and left expression in a `C` macro. So as we passed `cpuset`, `cpu` and etc., to the `SUBSYS` macro, somewhere `cpuset_cgrp_subsys`, `cp_cgrp_subsys` should be defined. And that's true. If you will look in the [kernel/cpuset.c](https://github.com/torvalds/linux/blob/master/kernel/cpuset.c) source code file, you will see this definition:
|
||||
|
||||
```C
|
||||
struct cgroup_subsys cpuset_cgrp_subsys = {
|
||||
...
|
||||
...
|
||||
...
|
||||
.early_init = true,
|
||||
};
|
||||
```
|
||||
|
||||
So the last step in the `cgroup_init_early` function is initialization of early subsystems with the call of the `cgroup_init_subsys` function. Following early subsystems will be initialized:
|
||||
|
||||
* `cpuset`;
|
||||
* `cpu`;
|
||||
* `cpuacct`.
|
||||
|
||||
The `cgroup_init_subsys` function does initialization of the given subsystem with the default values. For example sets root of hierarchy, allocates space for the given subsystem with the call of the `css_alloc` callback function, link a subsystem with a parent if it exists, add allocated subsystem to the initial process and etc.
|
||||
|
||||
That's all. From this moment early subsystems are initialized.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the first part which describes introduction into `Control groups` mechanism in the Linux kernel. We covered some theory and the first steps of initialization of stuffs related to `control groups` mechanism. In the next part we will continue to dive into the more practical aspects of `control groups`.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [control groups](https://en.wikipedia.org/wiki/Cgroups)
|
||||
* [PID](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [cpuset](http://man7.org/linux/man-pages/man7/cpuset.7.html)
|
||||
* [block devices](https://en.wikipedia.org/wiki/Device_file)
|
||||
* [huge pages](https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [proc](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [cgroups kernel documentation](https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt)
|
||||
* [cgroups v2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt)
|
||||
* [bash](https://www.gnu.org/software/bash/)
|
||||
* [docker](https://en.wikipedia.org/wiki/Docker_(software))
|
||||
* [perf events](https://en.wikipedia.org/wiki/Perf_(Linux))
|
||||
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html)
|
@ -0,0 +1,7 @@
|
||||
# Internal `system` structures of the Linux kernel
|
||||
|
||||
This is not usual chapter of `linux-insides`. As you may understand from the title, it mostly describes
|
||||
internal `system` structures of the Linux kernel. Like `Interrupt Descriptor Table`, `Global Descriptor
|
||||
Table` and many many more.
|
||||
|
||||
Most of information is taken from official [Intel](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) and [AMD](http://developer.amd.com/resources/developer-guides-manuals/) manuals.
|
@ -0,0 +1,190 @@
|
||||
interrupt-descriptor table (IDT)
|
||||
================================================================================
|
||||
|
||||
Three general interrupt & exceptions sources:
|
||||
|
||||
* Exceptions - sync;
|
||||
* Software interrupts - sync;
|
||||
* External interrupts - async.
|
||||
|
||||
Types of Exceptions:
|
||||
|
||||
* Faults - are precise exceptions reported on the boundary `before` the instruction causing the exception. The saved `%rip` points to the faulting instruction;
|
||||
* Traps - are precise exceptions reported on the boundary `following` the instruction causing the exception. The same with `%rip`;
|
||||
* Aborts - are imprecise exceptions. Because they are imprecise, aborts typically do not allow reliable program restart.
|
||||
|
||||
`Maskable` interrupts trigger the interrupt-handling mechanism only when RFLAGS.IF=1. Otherwise they are held pending for as long as the RFLAGS.IF bit is cleared to 0.
|
||||
|
||||
`Nonmaskable` interrupts (NMI) are unaffected by the value of the rFLAGS.IF bit. However, the occurrence of an NMI masks further NMIs until an IRET instruction is executed.
|
||||
|
||||
Specific exception and interrupt sources are assigned a fixed vector-identification number (also called an “interrupt vector” or simply “vector”). The interrupt vector is used by the interrupt-handling mechanism to locate the system-software service routine assigned to the exception or interrupt. Up to
|
||||
256 unique interrupt vectors are available. The first 32 vectors are reserved for predefined exception and interrupt conditions. They are defined in the [arch/x86/include/asm/traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121) header file:
|
||||
|
||||
```
|
||||
/* Interrupts/Exceptions */
|
||||
enum {
|
||||
X86_TRAP_DE = 0, /* 0, Divide-by-zero */
|
||||
X86_TRAP_DB, /* 1, Debug */
|
||||
X86_TRAP_NMI, /* 2, Non-maskable Interrupt */
|
||||
X86_TRAP_BP, /* 3, Breakpoint */
|
||||
X86_TRAP_OF, /* 4, Overflow */
|
||||
X86_TRAP_BR, /* 5, Bound Range Exceeded */
|
||||
X86_TRAP_UD, /* 6, Invalid Opcode */
|
||||
X86_TRAP_NM, /* 7, Device Not Available */
|
||||
X86_TRAP_DF, /* 8, Double Fault */
|
||||
X86_TRAP_OLD_MF, /* 9, Coprocessor Segment Overrun */
|
||||
X86_TRAP_TS, /* 10, Invalid TSS */
|
||||
X86_TRAP_NP, /* 11, Segment Not Present */
|
||||
X86_TRAP_SS, /* 12, Stack Segment Fault */
|
||||
X86_TRAP_GP, /* 13, General Protection Fault */
|
||||
X86_TRAP_PF, /* 14, Page Fault */
|
||||
X86_TRAP_SPURIOUS, /* 15, Spurious Interrupt */
|
||||
X86_TRAP_MF, /* 16, x87 Floating-Point Exception */
|
||||
X86_TRAP_AC, /* 17, Alignment Check */
|
||||
X86_TRAP_MC, /* 18, Machine Check */
|
||||
X86_TRAP_XF, /* 19, SIMD Floating-Point Exception */
|
||||
X86_TRAP_IRET = 32, /* 32, IRET Exception */
|
||||
};
|
||||
```
|
||||
|
||||
Error Codes
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The processor exception-handling mechanism reports error and status information for some exceptions using an error code. The error code is pushed onto the stack by the exception-mechanism during the control transfer into the exception handler. The error code has two formats:
|
||||
|
||||
* most error-reporting exceptions format;
|
||||
* page fault format.
|
||||
|
||||
Here is format of selector error code:
|
||||
|
||||
```
|
||||
31 16 15 3 2 1 0
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | | T | I | E |
|
||||
| Reserved | Selector Index | - | D | X |
|
||||
| | | I | T | T |
|
||||
+-------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `EXT` - If this bit is set to 1, the exception source is external to the processor. If cleared to 0, the exception source is internal to the processor;
|
||||
* `IDT` - If this bit is set to 1, the error-code selector-index field references a gate descriptor located in the `interrupt-descriptor table`. If cleared to 0, the selector-index field references a descriptor in either the `global-descriptor table` or local-descriptor table `LDT`, as indicated by the `TI` bit;
|
||||
* `TI` - If this bit is set to 1, the error-code selector-index field references a descriptor in the `LDT`. If cleared to 0, the selector-index field references a descriptor in the `GDT`.
|
||||
* `Selector Index` - The selector-index field specifies the index into either the `GDT`, `LDT`, or `IDT`, as specified by the `IDT` and `TI` bits.
|
||||
|
||||
Page-Fault Error Code format is:
|
||||
|
||||
```
|
||||
31 4 3 2 1 0
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | | R | U | R | - |
|
||||
| Reserved | I/D | S | - | - | P |
|
||||
| | | V | S | W | - |
|
||||
+-------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `I/D` - If this bit is set to 1, it indicates that the access that caused the page fault was an instruction fetch;
|
||||
* `RSV` - If this bit is set to 1, the page fault is a result of the processor reading a 1 from a reserved field within a page-translation-table entry;
|
||||
* `U/S` - If this bit is cleared to 0, an access in supervisor mode (`CPL=0, 1, or 2`) caused the page fault. If this bit is set to 1, an access in user mode (CPL=3) caused the page fault;
|
||||
* `R/W` - If this bit is cleared to 0, the access that caused the page fault is a memory read. If this bit is set to 1, the memory access that caused the page fault was a write;
|
||||
* `P` - If this bit is cleared to 0, the page fault was caused by a not-present page. If this bit is set to 1, the page fault was caused by a page-protection violation.
|
||||
|
||||
Interrupt Control Transfers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The IDT may contain any of three kinds of gate descriptors:
|
||||
|
||||
* `Task Gate` - contains the segment selector for a TSS for an exception and/or interrupt handler task;
|
||||
* `Interrupt Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an interrupt handler code segment;
|
||||
* `Trap Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an exception handler code segment.
|
||||
|
||||
General format of gates is:
|
||||
|
||||
```
|
||||
127 96
|
||||
+-------------------------------------------------------------------------------+
|
||||
| |
|
||||
| Reserved |
|
||||
| |
|
||||
+--------------------------------------------------------------------------------
|
||||
95 64
|
||||
+-------------------------------------------------------------------------------+
|
||||
| |
|
||||
| Offset 63..32 |
|
||||
| |
|
||||
+-------------------------------------------------------------------------------+
|
||||
63 48 47 46 44 42 39 34 32
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | | D | | | | | | |
|
||||
| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST |
|
||||
| | | L | | | | | | |
|
||||
-------------------------------------------------------------------------------+
|
||||
31 16 15 0
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | |
|
||||
| Segment Selector | Offset 15..0 |
|
||||
| | |
|
||||
+-------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where
|
||||
|
||||
* `Selector` - Segment Selector for destination code segment;
|
||||
* `Offset` - Offset to handler procedure entry point;
|
||||
* `DPL` - Descriptor Privilege Level;
|
||||
* `P` - Segment Present flag;
|
||||
* `IST` - Interrupt Stack Table;
|
||||
* `TYPE` - one of: Local descriptor-table (LDT) segment descriptor, Task-state segment (TSS) descriptor, Call-gate descriptor, Interrupt-gate descriptor, Trap-gate descriptor or Task-gate descriptor.
|
||||
|
||||
An `IDT` descriptor is represented by the following structure in the Linux kernel (only for `x86_64`):
|
||||
|
||||
```C
|
||||
struct gate_struct64 {
|
||||
u16 offset_low;
|
||||
u16 segment;
|
||||
unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1;
|
||||
u16 offset_middle;
|
||||
u32 offset_high;
|
||||
u32 zero1;
|
||||
} __attribute__((packed));
|
||||
```
|
||||
|
||||
which is defined in the [arch/x86/include/asm/desc_defs.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/desc_defs.h#L51) header file.
|
||||
|
||||
A task gate descriptor does not contain `IST` field and its format differs from interrupt/trap gates:
|
||||
|
||||
```C
|
||||
struct ldttss_desc64 {
|
||||
u16 limit0;
|
||||
u16 base0;
|
||||
unsigned base1 : 8, type : 5, dpl : 2, p : 1;
|
||||
unsigned limit1 : 4, zero0 : 3, g : 1, base2 : 8;
|
||||
u32 base3;
|
||||
u32 zero1;
|
||||
} __attribute__((packed));
|
||||
```
|
||||
|
||||
Exceptions During a Task Switch
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
An exception can occur during a task switch while loading a segment selector. Page faults can also occur when accessing a TSS. In these cases, the hardware task-switch mechanism completes loading the new task state from the TSS, and then triggers the appropriate exception mechanism.
|
||||
|
||||
**In long mode, an exception cannot occur during a task switch, because the hardware task-switch mechanism is disabled.**
|
||||
|
||||
Nonmaskable interrupt
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
**TODO**
|
||||
|
||||
API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
**TODO**
|
||||
|
||||
Interrupt Stack Table
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
**TODO**
|
@ -0,0 +1,488 @@
|
||||
Program startup process in userspace
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Despite the [linux-insides](https://www.gitbook.com/book/0xax/linux-insides/details) described mostly Linux kernel related stuff, I have decided to write this one part which mostly related to userspace.
|
||||
|
||||
There is already fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) of [System calls](https://en.wikipedia.org/wiki/System_call) chapter which describes what does the Linux kernel do when we want to start a program. In this part I want to explore what happens when we run a program on Linux machine from userspace perspective.
|
||||
|
||||
I don't know how about you, but in my university I learn that a `C` program starts to execute from the function which is called `main`. And that's partly true. Whenever we are starting to write new program, we start our program from the following lines of code:
|
||||
|
||||
```C
|
||||
int main(int argc, char *argv[]) {
|
||||
// Entry point is here
|
||||
}
|
||||
```
|
||||
|
||||
But if you are interested in low-level programming, you may already know that the `main` function isn't actual entry point of a program. You will believe it's true after you look at this simple program in debugger:
|
||||
|
||||
```C
|
||||
int main(int argc, char *argv[]) {
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Let's compile this and run in [gdb](https://www.gnu.org/software/gdb/):
|
||||
|
||||
```
|
||||
$ gcc -ggdb program.c -o program
|
||||
$ gdb ./program
|
||||
The target architecture is assumed to be i386:x86-64:intel
|
||||
Reading symbols from ./program...done.
|
||||
```
|
||||
|
||||
Let's execute gdb `info` subcommand with `files` argument. The `info files` prints information about debugging targets and memory spaces occupied by different sections.
|
||||
|
||||
```
|
||||
(gdb) info files
|
||||
Symbols from "/home/alex/program".
|
||||
Local exec file:
|
||||
`/home/alex/program', file type elf64-x86-64.
|
||||
Entry point: 0x400430
|
||||
0x0000000000400238 - 0x0000000000400254 is .interp
|
||||
0x0000000000400254 - 0x0000000000400274 is .note.ABI-tag
|
||||
0x0000000000400274 - 0x0000000000400298 is .note.gnu.build-id
|
||||
0x0000000000400298 - 0x00000000004002b4 is .gnu.hash
|
||||
0x00000000004002b8 - 0x0000000000400318 is .dynsym
|
||||
0x0000000000400318 - 0x0000000000400357 is .dynstr
|
||||
0x0000000000400358 - 0x0000000000400360 is .gnu.version
|
||||
0x0000000000400360 - 0x0000000000400380 is .gnu.version_r
|
||||
0x0000000000400380 - 0x0000000000400398 is .rela.dyn
|
||||
0x0000000000400398 - 0x00000000004003c8 is .rela.plt
|
||||
0x00000000004003c8 - 0x00000000004003e2 is .init
|
||||
0x00000000004003f0 - 0x0000000000400420 is .plt
|
||||
0x0000000000400420 - 0x0000000000400428 is .plt.got
|
||||
0x0000000000400430 - 0x00000000004005e2 is .text
|
||||
0x00000000004005e4 - 0x00000000004005ed is .fini
|
||||
0x00000000004005f0 - 0x0000000000400610 is .rodata
|
||||
0x0000000000400610 - 0x0000000000400644 is .eh_frame_hdr
|
||||
0x0000000000400648 - 0x000000000040073c is .eh_frame
|
||||
0x0000000000600e10 - 0x0000000000600e18 is .init_array
|
||||
0x0000000000600e18 - 0x0000000000600e20 is .fini_array
|
||||
0x0000000000600e20 - 0x0000000000600e28 is .jcr
|
||||
0x0000000000600e28 - 0x0000000000600ff8 is .dynamic
|
||||
0x0000000000600ff8 - 0x0000000000601000 is .got
|
||||
0x0000000000601000 - 0x0000000000601028 is .got.plt
|
||||
0x0000000000601028 - 0x0000000000601034 is .data
|
||||
0x0000000000601034 - 0x0000000000601038 is .bss
|
||||
```
|
||||
|
||||
Note on `Entry point: 0x400430` line. Now we know the actual address of entry point of our program. Let's put breakpoint by this address, run our program and see what happens:
|
||||
|
||||
```
|
||||
(gdb) break *0x400430
|
||||
Breakpoint 1 at 0x400430
|
||||
(gdb) run
|
||||
Starting program: /home/alex/program
|
||||
|
||||
Breakpoint 1, 0x0000000000400430 in _start ()
|
||||
```
|
||||
|
||||
Interesting. We don't see execution of `main` function here, but we have seen that another function is called. This function is `_start` and as debugger shows us, it is actual entry point of our program. Where is this function from? Who does call `main` and when is it called. I will try to answer all the questions in the following post.
|
||||
|
||||
How kernel does start new program
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
First of all, let's take a look at the following simple `C` program:
|
||||
|
||||
```C
|
||||
// program.c
|
||||
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
|
||||
static int x = 1;
|
||||
|
||||
int y = 2;
|
||||
|
||||
int main(int argc, char *argv[]) {
|
||||
int z = 3;
|
||||
|
||||
printf("x + y + z = %d\n", x + y + z);
|
||||
|
||||
return EXIT_SUCCESS;
|
||||
}
|
||||
```
|
||||
|
||||
We can be sure that this program works as we expect. Let's compile it:
|
||||
|
||||
```
|
||||
$ gcc -Wall program.c -o sum
|
||||
```
|
||||
|
||||
and run:
|
||||
|
||||
```
|
||||
$ ./sum
|
||||
x + y + z = 6
|
||||
```
|
||||
|
||||
Ok, everything looks pretty good up to now. You may already know that there is special family of [system calls](https://en.wikipedia.org/wiki/System_call) - [exec*](http://man7.org/linux/man-pages/man3/execl.3.html) system calls. As we read in the man page:
|
||||
|
||||
> The exec() family of functions replaces the current process image with a new process image.
|
||||
|
||||
If you have read fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) of the chapter which describes [system calls](https://en.wikipedia.org/wiki/System_call), you may know that for example [execve](http://linux.die.net/man/2/execve) system call is defined in the [files/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c#L1859) source code file and looks like:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(execve,
|
||||
const char __user *, filename,
|
||||
const char __user *const __user *, argv,
|
||||
const char __user *const __user *, envp)
|
||||
{
|
||||
return do_execve(getname(filename), argv, envp);
|
||||
}
|
||||
```
|
||||
|
||||
It takes executable file name, set of command line arguments and set of enviroment variables. As you may guess, everything is done by the `do_execve` function. I will not describe implementation of the `do_execve` function in details because you can read about this in [here](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html). But in short words, the `do_execve` function does many checks like `filename` is valid, limit of launched processes is not exceed in our system and etc. After all of these checks, this function parses our executable file which is represented in [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format, creates memory descriptor for newly executed executable file and fills it with the appropriate values like area for the stack, heap and etc. When the setup of new binary image is done, the `start_thread` function will set up one new process. This function is architecture-specific and for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, its definition will be located in the [arch/x86/kernel/process_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/process_64.c#L231) source code file.
|
||||
|
||||
The `start_thread` function sets new value to [segment registers](https://en.wikipedia.org/wiki/X86_memory_segmentation) and program execution address. From this point, new process is ready to start. Once the [context switch](https://en.wikipedia.org/wiki/Context_switch) will be done, control will be returned to the userspace with new values of registers and new executable will be started to execute.
|
||||
|
||||
That's all from the kernel side. The Linux kernel prepares binary image for execution and its execution starts right after context switch and returns controll to userspace when it is finished. But it does not answer on questions like where is from `_start` come and others. Let's try to answer on these questions in the next paragraph.
|
||||
|
||||
How does program start in userspace
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the previous paragraph we saw how an executable file is prepared to run by the Linux kernel. Let's look at the same, but from userspace side. We already know that entry point of each program is `_start` function. But where is this function from? It may came from a library. But if you remember correctly we didn't link our program with any libraries during compilation of our program:
|
||||
|
||||
```
|
||||
$ gcc -Wall program.c -o sum
|
||||
```
|
||||
|
||||
You may guess that `_start` comes from [stanard libray](https://en.wikipedia.org/wiki/Standard_library) and that's true. If you try to compile our program again and pass `-v` option to gcc which will enable `verbose mode`, you will see following long output. Full output is not interesting for us, let's look at the following steps:
|
||||
|
||||
First of all, our program should be compiled with `gcc`:
|
||||
|
||||
```
|
||||
$ gcc -v -ggdb program.c -o sum
|
||||
...
|
||||
...
|
||||
...
|
||||
/usr/libexec/gcc/x86_64-redhat-linux/6.1.1/cc1 -quiet -v program.c -quiet -dumpbase program.c -mtune=generic -march=x86-64 -auxbase test -ggdb -version -o /tmp/ccvUWZkF.s
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `cc1` compiler will compile our `C` source code and produce assembly `/tmp/ccvUWZkF.s` file. After this we can see that our assembly file will be compiled into object file with `GNU as` assembler:
|
||||
|
||||
```
|
||||
$ gcc -v -ggdb program.c -o sum
|
||||
...
|
||||
...
|
||||
...
|
||||
as -v --64 -o /tmp/cc79wZSU.o /tmp/ccvUWZkF.s
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
In the end our object file will be linked by `collect2`:
|
||||
|
||||
```
|
||||
$ gcc -v -ggdb program.c -o sum
|
||||
...
|
||||
...
|
||||
...
|
||||
/usr/libexec/gcc/x86_64-redhat-linux/6.1.1/collect2 -plugin /usr/libexec/gcc/x86_64-redhat-linux/6.1.1/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/x86_64-redhat-linux/6.1.1/lto-wrapper -plugin-opt=-fresolution=/tmp/ccLEGYra.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o test /usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/6.1.1/crtbegin.o -L/usr/lib/gcc/x86_64-redhat-linux/6.1.1 -L/usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L. -L/usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../.. /tmp/cc79wZSU.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/x86_64-redhat-linux/6.1.1/crtend.o /usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64/crtn.o
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Yes, we can see a long set of command line options which are passed to the linker. Let's go from another way. We know that our program depends on `stdlib`:
|
||||
|
||||
```
|
||||
$ ldd program
|
||||
linux-vdso.so.1 (0x00007ffc9afd2000)
|
||||
libc.so.6 => /lib64/libc.so.6 (0x00007f56b389b000)
|
||||
/lib64/ld-linux-x86-64.so.2 (0x0000556198231000)
|
||||
```
|
||||
|
||||
as we use some stuff from there like `printf` and etc. But not only. That's why we will get error when we will pass `-nostdlib` option to the compiler:
|
||||
|
||||
```
|
||||
$ gcc -nostdlib program.c -o program
|
||||
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 000000000040017c
|
||||
/tmp/cc02msGW.o: In function `main':
|
||||
/home/alex/program.c:11: undefined reference to `printf'
|
||||
collect2: error: ld returned 1 exit status
|
||||
```
|
||||
|
||||
Besides other errors, we also see that `_start` symbol is undefined. So now we are sure that the `_start` function comes from standard library. But even if we link it with standard library, it will not be compiled successfully anyway:
|
||||
|
||||
```
|
||||
$ gcc -nostdlib -lc -ggdb program.c -o program
|
||||
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000400350
|
||||
```
|
||||
|
||||
Ok, compiler does not complain about undefined reference of standard library functions as we linked our program with `/usr/lib64/libc.so.6`, but the `_start` symbol isn't resolved yet. Let's return to the verbose output of `gcc` and look at the parameters of `collect2`. The most important thing that we may see is that our program is linked not only with standard library, but also with some object files. The first object file is: `/lib64/crt1.o`. And if we look inside this object file with `objdump` util, we will see the `_start` symbol:
|
||||
|
||||
```
|
||||
$ objdump -d /lib64/crt1.o
|
||||
|
||||
/lib64/crt1.o: file format elf64-x86-64
|
||||
|
||||
|
||||
Disassembly of section .text:
|
||||
|
||||
0000000000000000 <_start>:
|
||||
0: 31 ed xor %ebp,%ebp
|
||||
2: 49 89 d1 mov %rdx,%r9
|
||||
5: 5e pop %rsi
|
||||
6: 48 89 e2 mov %rsp,%rdx
|
||||
9: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
|
||||
d: 50 push %rax
|
||||
e: 54 push %rsp
|
||||
f: 49 c7 c0 00 00 00 00 mov $0x0,%r8
|
||||
16: 48 c7 c1 00 00 00 00 mov $0x0,%rcx
|
||||
1d: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
|
||||
24: e8 00 00 00 00 callq 29 <_start+0x29>
|
||||
29: f4 hlt
|
||||
```
|
||||
|
||||
As `crt1.o` is a shared object file, we see only stubs here instead of real calls. Let's look at the source code of the `_start` function. As this function is architecture specific, implementation for `_start` will be located in the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=f1b961f5ba2d6a1ebffee0005f43123c4352fbf4;hb=HEAD) assembly file.
|
||||
|
||||
The `_start` starts from the clearing of `ebp` register as [ABI](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf) suggests.
|
||||
|
||||
```assembly
|
||||
xorl %ebp, %ebp
|
||||
```
|
||||
|
||||
And after this we put the address of termination function to the `r9` register:
|
||||
|
||||
```assembly
|
||||
mov %RDX_LP, %R9_LP
|
||||
```
|
||||
|
||||
As described in the [ELF](http://flint.cs.yale.edu/cs422/doc/ELF_Format.pdf) specification:
|
||||
|
||||
> After the dynamic linker has built the process image and performed the relocations, each shared object
|
||||
> gets the opportunity to execute some initialization code.
|
||||
> ...
|
||||
> Similarly, shared objects may have termination functions, which are executed with the atexit (BA_OS)
|
||||
> mechanism after the base process begins its termination sequence.
|
||||
|
||||
So we need to put address of termination function to the `r9` register as it will be passed `__libc_start_main` in future as sixth argument. Note that the address of the termination function initially is located in the `rdx` register. Other registers besides `rdx` and `rsp` contain unspecified values. Actually main point of the `_start` function is to call `__libc_start_main`. So the next action is to prepare for this function.
|
||||
|
||||
The signature of the `__libc_start_main` function is located in the [csu/libc-start.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-start.c;h=0fb98f1606bab475ab5ba2d0fe08c64f83cce9df;hb=HEAD) source code file. Let's look on it:
|
||||
|
||||
```C
|
||||
STATIC int LIBC_START_MAIN (int (*main) (int, char **, char **),
|
||||
int argc,
|
||||
char **argv,
|
||||
__typeof (main) init,
|
||||
void (*fini) (void),
|
||||
void (*rtld_fini) (void),
|
||||
void *stack_end)
|
||||
```
|
||||
|
||||
It takes address of the `main` function of a program, `argc` and `argv`. `init` and `fini` functions are constructor and destructor of the program. The `rtld_fini` is termination function which will be called after the program will be exited to terminate and free dynamic section. The last parameter of the `__libc_start_main` is the pointer to the stack of the program. Before we can call the `__libc_start_main` function, all of these parameters must be prepared and passed to it. Let's return to the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=f1b961f5ba2d6a1ebffee0005f43123c4352fbf4;hb=HEAD) assembly file and continue to see what happens before the `__libc_start_main` function will be called from there.
|
||||
|
||||
We can get all the arguments we need for `__libc_start_main` function from the stack. As `_start` is called, our stack looks like:
|
||||
|
||||
```
|
||||
+-----------------+
|
||||
| NULL |
|
||||
+-----------------+
|
||||
| envp |
|
||||
+-----------------+
|
||||
| NULL |
|
||||
+------------------
|
||||
| argv | <- rsp
|
||||
+------------------
|
||||
| argc |
|
||||
+-----------------+
|
||||
```
|
||||
|
||||
After we cleared `ebp` register and saved address of the termination function in the `r9` register, we pop element from the stack to the `rsi` register, so after this `rsp` will point to the `argv` array and `rsi` will contain count of command line arguemnts passed to the program:
|
||||
|
||||
```
|
||||
+-----------------+
|
||||
| NULL |
|
||||
+-----------------+
|
||||
| envp |
|
||||
+-----------------+
|
||||
| NULL |
|
||||
+------------------
|
||||
| argv | <- rsp
|
||||
+-----------------+
|
||||
```
|
||||
|
||||
After this we move address of the `argv` array to the `rdx` register
|
||||
|
||||
```assembly
|
||||
popq %rsi
|
||||
mov %RSP_LP, %RDX_LP
|
||||
```
|
||||
|
||||
From this moment we have `argc`, `argv`. We still need to put pointers to the construtor, destructor in appropriate registers and pass pointer to the stack. At the first following three lines we align stack to `16` bytes boundary as suggested in [ABI](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf) and push `rax` which contains garbage:
|
||||
|
||||
```assembly
|
||||
and $~15, %RSP_LP
|
||||
pushq %rax
|
||||
|
||||
pushq %rsp
|
||||
mov $__libc_csu_fini, %R8_LP
|
||||
mov $__libc_csu_init, %RCX_LP
|
||||
mov $main, %RDI_LP
|
||||
```
|
||||
|
||||
After stack aligning we push address of the stack, move addresses of contstructor and destructor to the `r8` and `rcx` registers and address of the `main` symbol to the `rdi`. From this moment we can call the `__libc_start_main` function from the [csu/libc-start.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-start.c;h=0fb98f1606bab475ab5ba2d0fe08c64f83cce9df;hb=HEAD).
|
||||
|
||||
Before we look at the `__libc_start_main` function, let's add the `/lib64/crt1.o` and try to compile our program again:
|
||||
|
||||
```
|
||||
$ gcc -nostdlib /lib64/crt1.o -lc -ggdb program.c -o program
|
||||
/lib64/crt1.o: In function `_start':
|
||||
(.text+0x12): undefined reference to `__libc_csu_fini'
|
||||
/lib64/crt1.o: In function `_start':
|
||||
(.text+0x19): undefined reference to `__libc_csu_init'
|
||||
collect2: error: ld returned 1 exit status
|
||||
```
|
||||
|
||||
Now we see another error that both `__libc_csu_fini` and `__libc_csu_init` functions are not found. We know that addresses of these two functions are passed to the `__libc_start_main` as parameters and also these functions are constructor and destructor of our programs. But what do `constructor` and `destructor` in terms of `C` program means? We already saw the quote from the [ELF](http://flint.cs.yale.edu/cs422/doc/ELF_Format.pdf) specification:
|
||||
|
||||
> After the dynamic linker has built the process image and performed the relocations, each shared object
|
||||
> gets the opportunity to execute some initialization code.
|
||||
> ...
|
||||
> Similarly, shared objects may have termination functions, which are executed with the atexit (BA_OS)
|
||||
> mechanism after the base process begins its termination sequence.
|
||||
|
||||
So the linker creates two special sections besides usual sections like `.text`, `.data` and others:
|
||||
|
||||
* `.init`
|
||||
* `.fini`
|
||||
|
||||
We can find them with `readelf` util:
|
||||
|
||||
```
|
||||
$ readelf -e test | grep init
|
||||
[11] .init PROGBITS 00000000004003c8 000003c8
|
||||
|
||||
$ readelf -e test | grep fini
|
||||
[15] .fini PROGBITS 0000000000400504 00000504
|
||||
```
|
||||
|
||||
Both of these sections will be placed at the start and end of binary image and contain routines which are called constructor and destructor respectively. The main point of these routines is to do some initialization/finalization like initialization of global variables, such as [errno](http://man7.org/linux/man-pages/man3/errno.3.html), allocation and deallocation of memory for system routines and etc., before actual code of a program is executed.
|
||||
|
||||
You may infer from names of these functions, they will be called before `main` function and after the `main` function. Definitions of `.init` and `.fini` sections are located in the `/lib64/crti.o` and if we add this object file:
|
||||
|
||||
```
|
||||
$ gcc -nostdlib /lib64/crt1.o /lib64/crti.o -lc -ggdb program.c -o program
|
||||
```
|
||||
|
||||
we will not get any errors. But let's try to run our program and see what happens:
|
||||
|
||||
```
|
||||
$ ./program
|
||||
Segmentation fault (core dumped)
|
||||
```
|
||||
|
||||
Yeah, we got segmentation fault. Let's look inside of the `lib64/crti.o` with `objdump` util:
|
||||
|
||||
```
|
||||
$ objdump -D /lib64/crti.o
|
||||
|
||||
/lib64/crti.o: file format elf64-x86-64
|
||||
|
||||
|
||||
Disassembly of section .init:
|
||||
|
||||
0000000000000000 <_init>:
|
||||
0: 48 83 ec 08 sub $0x8,%rsp
|
||||
4: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # b <_init+0xb>
|
||||
b: 48 85 c0 test %rax,%rax
|
||||
e: 74 05 je 15 <_init+0x15>
|
||||
10: e8 00 00 00 00 callq 15 <_init+0x15>
|
||||
|
||||
Disassembly of section .fini:
|
||||
|
||||
0000000000000000 <_fini>:
|
||||
0: 48 83 ec 08 sub $0x8,%rsp
|
||||
```
|
||||
|
||||
As I wrote above, the `/lib64/crti.o` object file contains definition of the `.init` and `.fini` section, but also we can see here the stub for function. Let's look at the source code which is placed in the [sysdeps/x86_64/crti.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/crti.S;h=e9d86ed08ab134a540e3dae5f97a9afb82cdb993;hb=HEAD) source code file:
|
||||
|
||||
```assembly
|
||||
.section .init,"ax",@progbits
|
||||
.p2align 2
|
||||
.globl _init
|
||||
.type _init, @function
|
||||
_init:
|
||||
subq $8, %rsp
|
||||
movq PREINIT_FUNCTION@GOTPCREL(%rip), %rax
|
||||
testq %rax, %rax
|
||||
je .Lno_weak_fn
|
||||
call *%rax
|
||||
.Lno_weak_fn:
|
||||
call PREINIT_FUNCTION
|
||||
```
|
||||
|
||||
It contains definition of the `.init` section and assembly code does 16-byte stack alignment and next we move address of the `PREINIT_FUNCTION` and if it is zero we don't call it:
|
||||
|
||||
```
|
||||
00000000004003c8 <_init>:
|
||||
4003c8: 48 83 ec 08 sub $0x8,%rsp
|
||||
4003cc: 48 8b 05 25 0c 20 00 mov 0x200c25(%rip),%rax # 600ff8 <_DYNAMIC+0x1d0>
|
||||
4003d3: 48 85 c0 test %rax,%rax
|
||||
4003d6: 74 05 je 4003dd <_init+0x15>
|
||||
4003d8: e8 43 00 00 00 callq 400420 <__libc_start_main@plt+0x10>
|
||||
4003dd: 48 83 c4 08 add $0x8,%rsp
|
||||
4003e1: c3 retq
|
||||
```
|
||||
|
||||
where the `PREINIT_FUNCTION` is the `__gmon_start__` which does setup for profiling. You may note that we have no return instruction in the [sysdeps/x86_64/crti.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/crti.S;h=e9d86ed08ab134a540e3dae5f97a9afb82cdb993;hb=HEAD). Actually that's why we got segmentation fault. Prolog of `_init` and `_fini` is placed in the [sysdeps/x86_64/crtn.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/crtn.S;h=e9d86ed08ab134a540e3dae5f97a9afb82cdb993;hb=HEAD) assembly file:
|
||||
|
||||
```assembly
|
||||
.section .init,"ax",@progbits
|
||||
addq $8, %rsp
|
||||
ret
|
||||
|
||||
.section .fini,"ax",@progbits
|
||||
addq $8, %rsp
|
||||
ret
|
||||
```
|
||||
|
||||
and if we will add it to the compilation, our program will be successfully compiled and run!
|
||||
|
||||
```
|
||||
$ gcc -nostdlib /lib64/crt1.o /lib64/crti.o /lib64/crtn.o -lc -ggdb program.c -o program
|
||||
|
||||
$ ./program
|
||||
x + y + z = 6
|
||||
```
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Now let's return to the `_start` function and try to go through a full chain of calls before the `main` of our program will be called.
|
||||
|
||||
The `_start` is always placed at the beginning of the `.text` section in our programs by the linked which is used default `ld` script:
|
||||
|
||||
```
|
||||
$ ld --verbose | grep ENTRY
|
||||
ENTRY(_start)
|
||||
```
|
||||
|
||||
The `_start` function is defined in the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=f1b961f5ba2d6a1ebffee0005f43123c4352fbf4;hb=HEAD) assembly file and does preparation like getting `argc/argv` from the stack, stack preparation and etc., before the `__libc_start_main` function will be called. The `__libc_start_main` function from the [csu/libc-start.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-start.c;h=0fb98f1606bab475ab5ba2d0fe08c64f83cce9df;hb=HEAD) source code file does a registration of the constructor and destructor of application which are will be called before `main` and after it, starts up threading, does some security related actions like setting stack canary if need, calls initialization related routines and in the end it calls `main` function of our application and exits with its result:
|
||||
|
||||
```C
|
||||
result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
|
||||
exit (result);
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [gdb](https://www.gnu.org/software/gdb/)
|
||||
* [execve](http://linux.die.net/man/2/execve)
|
||||
* [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [segment registers](https://en.wikipedia.org/wiki/X86_memory_segmentation)
|
||||
* [context switch](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [System V ABI](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf)
|
@ -0,0 +1,352 @@
|
||||
Synchronization primitives in the Linux kernel. Part 6.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the sixth part of the chapter which describes [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science)) in the Linux kernel and in the previous parts we finished to consider different [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitives. We will continue to learn synchronization primitives in this part and start to consider a similar synchronization primitive which can be used to avoid the `writer starvation` problem. The name of this synchronization primitive is - `seqlock` or `sequential locks`.
|
||||
|
||||
We know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which aqcuired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time.
|
||||
|
||||
The `seqlock` synchronization primitive can help solve this problem.
|
||||
|
||||
As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `seqlocks`.
|
||||
|
||||
So, let's start.
|
||||
|
||||
Sequential lock
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast.
|
||||
|
||||
Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again.
|
||||
|
||||
Read only access works on the following principle, it gets the value of a `sequential lock` counter before it will enter into [critical section](https://en.wikipedia.org/wiki/Critical_section) and compares it with the value of the same `sequential lock` counter at the exit of critical section. If their values are equal, this means that there weren't writers for this period. If their values are not equal, this means that a writer has incremented the counter during the [critical section](https://en.wikipedia.org/wiki/Critical_section). This conflict means that reading of protected data must be repeated.
|
||||
|
||||
That's all. As we may see principle of work of `sequential locks` is simple.
|
||||
|
||||
```C
|
||||
unsigned int seq_counter_value;
|
||||
|
||||
do {
|
||||
seq_counter_value = get_seq_counter_val(&the_lock);
|
||||
//
|
||||
// do as we want here
|
||||
//
|
||||
} while (__retry__);
|
||||
```
|
||||
|
||||
Actually the Linux kernel does not provide `get_seq_counter_val()` function. Here it is just a stub. Like a `__retry__` too. As I already wrote above, we will see actual the [API](https://en.wikipedia.org/wiki/Application_programming_interface) for this in the next paragraph of this part.
|
||||
|
||||
Ok, now we know what a `seqlock` synchronization primitive is and how it is represented in the Linux kernel. In this case, we may go ahead and start to look at the [API](https://en.wikipedia.org/wiki/Application_programming_interface) which the Linux kernel provides for manipulation of synchronization primitives of this type.
|
||||
|
||||
Sequential lock API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, now we know a little about `sequentional lock` synchronization primitive from theoretical side, let's look at its implementation in the Linux kernel. All `sequentional locks` [API](https://en.wikipedia.org/wiki/Application_programming_interface) are located in the [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file.
|
||||
|
||||
First of all we may see that the a `sequential lock` machanism is represented by the following type:
|
||||
|
||||
```C
|
||||
typedef struct {
|
||||
struct seqcount seqcount;
|
||||
spinlock_t lock;
|
||||
} seqlock_t;
|
||||
```
|
||||
|
||||
As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure:
|
||||
|
||||
```C
|
||||
typedef struct seqcount {
|
||||
unsigned sequence;
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
struct lockdep_map dep_map;
|
||||
#endif
|
||||
} seqcount_t;
|
||||
```
|
||||
|
||||
which holds counter of a sequential lock and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related field.
|
||||
|
||||
As always in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/), before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `sequential lock` mechanism in the Linux kernel, we need to know how to initialize an instance of `seqlock_t`.
|
||||
|
||||
We saw in the previous parts that often the Linux kernel provides two approaches to execute initialization of the given synchronization primitive. The same situation with the `seqlock_t` structure. These approaches allows to initialize a `seqlock_t` in two following:
|
||||
|
||||
* `statically`;
|
||||
* `dynamically`.
|
||||
|
||||
ways. Let's look at the first approach. We are able to intialize a `seqlock_t` statically with the `DEFINE_SEQLOCK` macro:
|
||||
|
||||
```C
|
||||
#define DEFINE_SEQLOCK(x) \
|
||||
seqlock_t x = __SEQLOCK_UNLOCKED(x)
|
||||
```
|
||||
|
||||
which is defined in the [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file. As we may see, the `DEFINE_SEQLOCK` macro takes one argument and expands to the definition and initialization of the `seqlock_t` structure. Initialization occurs with the help of the `__SEQLOCK_UNLOCKED` macro which is defined in the same source code file. Let's look at the implementation of this macro:
|
||||
|
||||
```C
|
||||
#define __SEQLOCK_UNLOCKED(lockname) \
|
||||
{ \
|
||||
.seqcount = SEQCNT_ZERO(lockname), \
|
||||
.lock = __SPIN_LOCK_UNLOCKED(lockname) \
|
||||
}
|
||||
```
|
||||
|
||||
As we may see the, `__SEQLOCK_UNLOCKED` macro executes initialization of fields of the given `seqlock_t` structure. The first field is `seqcount` initialized with the `SEQCNT_ZERO` macro which expands to the:
|
||||
|
||||
```C
|
||||
#define SEQCNT_ZERO(lockname) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(lockname)}
|
||||
```
|
||||
|
||||
So we just initialize counter of the given sequential lock to zero and additionally we can see [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related initialization which depends on the state of the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
# define SEQCOUNT_DEP_MAP_INIT(lockname) \
|
||||
.dep_map = { .name = #lockname } \
|
||||
...
|
||||
...
|
||||
...
|
||||
#else
|
||||
# define SEQCOUNT_DEP_MAP_INIT(lockname)
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
As I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/)).
|
||||
|
||||
We have considered the first way to initialize a sequential lock. Let's consider second way to do the same, but do it dynamically. We can initialize a sequentional lock with the `seqlock_init` macro which is defined in the same [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file.
|
||||
|
||||
Let's look at the implementation of this macro:
|
||||
|
||||
```C
|
||||
#define seqlock_init(x) \
|
||||
do { \
|
||||
seqcount_init(&(x)->seqcount); \
|
||||
spin_lock_init(&(x)->lock); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
As we may see, the `seqlock_init` expands into two macros. The first macro `seqcount_init` takes counter of the given sequential lock and expands to the call of the `__seqcount_init` function:
|
||||
|
||||
```C
|
||||
# define seqcount_init(s) \
|
||||
do { \
|
||||
static struct lock_class_key __key; \
|
||||
__seqcount_init((s), #s, &__key); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
from the same header file. This function
|
||||
|
||||
```C
|
||||
static inline void __seqcount_init(seqcount_t *s, const char *name,
|
||||
struct lock_class_key *key)
|
||||
{
|
||||
lockdep_init_map(&s->dep_map, name, key, 0);
|
||||
s->sequence = 0;
|
||||
}
|
||||
```
|
||||
|
||||
just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter.
|
||||
|
||||
So, now we know how to initialize a `sequential lock`, now let's look at how to use it. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `sequential locks`:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqbegin(const seqlock_t *sl);
|
||||
static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start);
|
||||
static inline void write_seqlock(seqlock_t *sl);
|
||||
static inline void write_sequnlock(seqlock_t *sl);
|
||||
static inline void write_seqlock_irq(seqlock_t *sl);
|
||||
static inline void write_sequnlock_irq(seqlock_t *sl);
|
||||
static inline void read_seqlock_excl(seqlock_t *sl)
|
||||
static inline void read_sequnlock_excl(seqlock_t *sl)
|
||||
```
|
||||
|
||||
and others. Before we move on to considering the implementation of this [API](https://en.wikipedia.org/wiki/Application_programming_interface), we must know that actually there are two types of readers. The first type of reader never blocks a writer process. In this case writer will not wait for readers. The second type of reader which can lock. In this case, the locking reader will block the writer as it will wait while reader will not release its lock.
|
||||
|
||||
First of all let's consider the first type of readers. The `read_seqbegin` function begins a seq-read [critical section](https://en.wikipedia.org/wiki/Critical_section).
|
||||
|
||||
As we may see this function just returns value of the `read_seqcount_begin` function:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqbegin(const seqlock_t *sl)
|
||||
{
|
||||
return read_seqcount_begin(&sl->seqcount);
|
||||
}
|
||||
```
|
||||
|
||||
In its turn the `read_seqcount_begin` function calls the `raw_read_seqcount_begin` function:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqcount_begin(const seqcount_t *s)
|
||||
{
|
||||
return raw_read_seqcount_begin(s);
|
||||
}
|
||||
```
|
||||
|
||||
which just returns value of the `sequential lock` counter:
|
||||
|
||||
```C
|
||||
static inline unsigned raw_read_seqcount(const seqcount_t *s)
|
||||
{
|
||||
unsigned ret = READ_ONCE(s->sequence);
|
||||
smp_rmb();
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
After we have the initial value of the given `sequential lock` counter and did some stuff, we know from the previous paragraph of this function, that we need to compare it with the current value of the counter the same `sequential lock` before we will exit from the critical section. We can achieve this by the call of the `read_seqretry` function. This function takes a `sequential lock`, start value of the counter and through a chain of functions:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start)
|
||||
{
|
||||
return read_seqcount_retry(&sl->seqcount, start);
|
||||
}
|
||||
|
||||
static inline int read_seqcount_retry(const seqcount_t *s, unsigned start)
|
||||
{
|
||||
smp_rmb();
|
||||
return __read_seqcount_retry(s, start);
|
||||
}
|
||||
```
|
||||
|
||||
it calls the `__read_seqcount_retry` function:
|
||||
|
||||
```C
|
||||
static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
|
||||
{
|
||||
return unlikely(s->sequence != start);
|
||||
}
|
||||
```
|
||||
|
||||
which just compares value of the counter of the given `sequential lock` with the initial value of this counter. If the initial value of the counter which is obtained from `read_seqbegin()` function is odd, this means that a writer was in the middle of updating the data when our reader began to act. In this case the value of the data can be in inconsistent state, so we need to try to read it again.
|
||||
|
||||
This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of the [timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
|
||||
```C
|
||||
u64 get_jiffies_64(void)
|
||||
{
|
||||
unsigned long seq;
|
||||
u64 ret;
|
||||
|
||||
do {
|
||||
seq = read_seqbegin(&jiffies_lock);
|
||||
ret = jiffies_64;
|
||||
} while (read_seqretry(&jiffies_lock, seq));
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
Here we just read the value of the counter of the `jiffies_lock` sequential lock and then we write value of the `jiffies_64` system variable to the `ret`. As here we may see `do/while` loop, the body of the loop will be executed at least one time. So, as the body of loop was executed, we read and compare the current value of the counter of the `jiffies_lock` with the initial value. If these values are not equal, execution of the loop will be repeated, else `get_jiffies_64` will return its value in `ret`.
|
||||
|
||||
We just saw the first type of readers which do not block writer and other readers. Let's consider second type. It does not update value of a `sequential lock` counter, but just locks `spinlock`:
|
||||
|
||||
```C
|
||||
static inline void read_seqlock_excl(seqlock_t *sl)
|
||||
{
|
||||
spin_lock(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
So, no one reader or writer can't access protected data. When a reader finishes, the lock must be unlocked with the:
|
||||
|
||||
```C
|
||||
static inline void read_sequnlock_excl(seqlock_t *sl)
|
||||
{
|
||||
spin_unlock(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
function.
|
||||
|
||||
Now we know how `sequential lock` work for readers. Let's consider how does writer act when it wants to acquire a `sequential lock` to modify data. To acquire a `sequential lock`, writer should use `write_seqlock` function. If we look at the implementation of this function:
|
||||
|
||||
```C
|
||||
static inline void write_seqlock(seqlock_t *sl)
|
||||
{
|
||||
spin_lock(&sl->lock);
|
||||
write_seqcount_begin(&sl->seqcount);
|
||||
}
|
||||
```
|
||||
|
||||
We will see that it acquires `spinlock` to prevent access from other writers and calls the `write_seqcount_begin` function. This function just increments value of the `sequential lock` counter:
|
||||
|
||||
```C
|
||||
static inline void raw_write_seqcount_begin(seqcount_t *s)
|
||||
{
|
||||
s->sequence++;
|
||||
smp_wmb();
|
||||
}
|
||||
```
|
||||
|
||||
When a writer process will finish to modify data, the `write_sequnlock` function must be called to release a lock and give access to other writers or readers. Let's consider at the implementation of the `write_sequnlock` function. It looks pretty simple:
|
||||
|
||||
```C
|
||||
static inline void write_sequnlock(seqlock_t *sl)
|
||||
{
|
||||
write_seqcount_end(&sl->seqcount);
|
||||
spin_unlock(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
First of all it just calls `write_seqcount_end` function to increase value of the counter of the `sequential` lock again:
|
||||
|
||||
```C
|
||||
static inline void raw_write_seqcount_end(seqcount_t *s)
|
||||
{
|
||||
smp_wmb();
|
||||
s->sequence++;
|
||||
}
|
||||
```
|
||||
|
||||
and in the end we just call the `spin_unlock` macro to give access for other readers or writers.
|
||||
|
||||
That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html): `write_seqclock_irq` and `write_sequnlock_irq`:
|
||||
|
||||
```C
|
||||
static inline void write_seqlock_irq(seqlock_t *sl)
|
||||
{
|
||||
spin_lock_irq(&sl->lock);
|
||||
write_seqcount_begin(&sl->seqcount);
|
||||
}
|
||||
|
||||
static inline void write_sequnlock_irq(seqlock_t *sl)
|
||||
{
|
||||
write_seqcount_end(&sl->seqcount);
|
||||
spin_unlock_irq(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
As we may see, these functions differ only in the initialization of spinlock. They call `spin_lock_irq` and `spin_unlock_irq` instead of `spin_lock` and `spin_unlock`.
|
||||
|
||||
Or for example `write_seqlock_irqsave` and `write_sequnlock_irqrestore` functions which are the same but used `spin_lock_irqsave` and `spin_unlock_irqsave` macro to use in [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture)) handlers.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the sixth part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In this part we met with new synchronization primitive which is called - `sequential lock`. From the theoretical side, this synchronization primitive very similar on a [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitive, but allows to avoid `writer-starving` issue.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science))
|
||||
* [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
||||
* [critical section](https://en.wikipedia.org/wiki/Critical_section)
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [debugging](https://en.wikipedia.org/wiki/Debugging)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/)
|
||||
* [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture))
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html)
|
@ -1,14 +1,14 @@
|
||||
# Interrupts and Interrupt Handling
|
||||
|
||||
You will find a couple of posts which describe interrupts and exceptions handling in the linux kernel.
|
||||
In the following posts, we will cover interrupts and exceptions handling in the linux kernel.
|
||||
|
||||
* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes an interrupts handling theory.
|
||||
* [Start to dive into interrupts in the Linux kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - this part starts to describe interrupts and exceptions handling related stuff from the early stage.
|
||||
* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - third part describes early interrupt handlers.
|
||||
* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - fourth part describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - descripbes implementation of some exception handlers as double fault, divide by zero and etc.
|
||||
* [Handling Non-Maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and the rest of interrupts handlers from the architecture-specific part.
|
||||
* [Dive into external hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - this part describes early initialization of code which is related to handling of external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - this part describes non-early initialization of code which is related to handling of external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-9.md) - this part describes softirqs, tasklets and workqueues concepts.
|
||||
* [](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-10.md) - this is the last part of the interrupts and interrupt handling chapter and here we will see a real hardware driver and interrupts related stuff.
|
||||
* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes interrupts and interrupt handling theory.
|
||||
* [Interrupts in the Linux Kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
|
||||
* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - describes early interrupt handlers.
|
||||
* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
|
||||
* [Handling non-maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
|
||||
* [External hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
|
||||
* [](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.
|
||||
|
Loading…
Reference in New Issue