commit
763e6358b7
@ -0,0 +1,8 @@
|
||||
# System calls
|
||||
|
||||
This chapter describes the `system call` concept in the linux kernel. You will see here a
|
||||
couple of posts which describe the full cycle of the kernel loading process:
|
||||
|
||||
* [Introduction to system call concept](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) - this part is introduction to the `system call` concept in the Linux kernel.
|
||||
* [How the Linux kernel handles a system call](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - this part describes how the Linux kernel handles a system call from an userspace application.
|
||||
* [vsyscall and vDSO](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - third part describes `vsyscall` and `vDSO` concepts.
|
@ -0,0 +1,415 @@
|
||||
System calls in the Linux kernel. Part 1.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This post opens up a new chapter in [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book and as you may understand from the title, this chapter will be devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of topic for this chapter is not accidental. In the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw interrupts and interrupt handling. The concept of system calls is very similar to that of interrupts. This is because the most common way to implement system calls is as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace, we will see an implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more.
|
||||
|
||||
Before we start to dive into the implementation of the system calls related stuff in the Linux kernel source code, it is good to know some theory about system calls. Let's do it in the following paragraph.
|
||||
|
||||
System call. What is it?
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
A system call is just a userspace request of a kernel service. Yes, the operating system kernel provides many services. When your program wants to write to or read from a file, starts to listen for connections on a [socket](https://en.wikipedia.org/wiki/Network_socket), delete or create a directory, or even to finish its work, a program uses a system call. In another words, a system call is just a [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) function that is placed in the kernel space and a user program can ask the kernel to do something via this function.
|
||||
|
||||
The Linux kernel provides a set of these functions and each architecture provides its own set. For example: the [x86_64](https://en.wikipedia.org/wiki/X86-64) provides [322](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) system calls and the [x86](https://en.wikipedia.org/wiki/X86) provides [358](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_32.tbl) different system calls. Ok, a system call is just a function. Let's look on a simple `Hello world` example that's written in the assembly programming language:
|
||||
|
||||
```assembly
|
||||
.data
|
||||
|
||||
msg:
|
||||
.ascii "Hello, world!\n"
|
||||
len = . - msg
|
||||
|
||||
.text
|
||||
.global _start
|
||||
|
||||
_start:
|
||||
movq $1, %rax
|
||||
movq $1, %rdi
|
||||
movq $msg, %rsi
|
||||
movq $len, %rdx
|
||||
syscall
|
||||
|
||||
movq $60, %rax
|
||||
xorq %rdi, %rdi
|
||||
syscall
|
||||
```
|
||||
|
||||
We can compile the above with the following commands:
|
||||
|
||||
```
|
||||
$ gcc -c test.S
|
||||
$ ld -o test test.o
|
||||
```
|
||||
|
||||
and run it as follows:
|
||||
|
||||
```
|
||||
./test
|
||||
Hello, world!
|
||||
```
|
||||
|
||||
Ok, what do we see here? This simple code represents `Hello world` assembly program for the Linux `x86_64` architecture. We can see two sections here:
|
||||
|
||||
* `.data`
|
||||
* `.text`
|
||||
|
||||
The first section - `.data` stores initialized data of our program (`Hello world` string and its length in our case). The second section - `.text` contains the code of our program. We can split the code of our program into two parts: first part will be before the first `syscall` instruction and the second part will be between first and second `syscall` instructions. First of all what does the `syscall` instruction do in our code and generally? As we can read in the [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html):
|
||||
|
||||
```
|
||||
SYSCALL invokes an OS system-call handler at privilege level 0. It does so by
|
||||
loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction
|
||||
following SYSCALL into RCX). (The WRMSR instruction ensures that the
|
||||
IA32_LSTAR MSR always contain a canonical address.)
|
||||
...
|
||||
...
|
||||
...
|
||||
SYSCALL loads the CS and SS selectors with values derived from bits 47:32 of the
|
||||
IA32_STAR MSR. However, the CS and SS descriptor caches are not loaded from the
|
||||
descriptors (in GDT or LDT) referenced by those selectors.
|
||||
|
||||
Instead, the descriptor caches are loaded with fixed values. It is the respon-
|
||||
sibility of OS software to ensure that the descriptors (in GDT or LDT) referenced
|
||||
by those selector values correspond to the fixed values loaded into the descriptor
|
||||
caches; the SYSCALL instruction does not ensure this correspondence.
|
||||
```
|
||||
|
||||
and we are initializing `syscalls` by the writing of the `entry_SYSCALL_64` that defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembler file and represents `SYSCALL` instruction entry to the `IA32_STAR` [Model specific register](https://en.wikipedia.org/wiki/Model-specific_register):
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
|
||||
```
|
||||
|
||||
in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) source code file.
|
||||
|
||||
So, the `syscall` instruction invokes a handler of a given system call. But how does it know which handler to call? Actually it gets this information from the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register). As you can see in the system call [table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl), each system call has an unique number. In our example, first system call is - `write` that writes data to the given file. Let's look in the system call table and try to find `write` system call. As we can see, the [write](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L10) system call has number - `1`. We pass the number of this system call through the `rax` register in our example. The next general purpose registers: `%rdi`, `%rsi` and `%rdx` take parameters of the `write` syscall. In our case, they are [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) (`1` is [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29) in our case), second parameter is the pointer to our string, and the third is size of data. Yes, you heard right. Parameters for a system call. As I already wrote above, a system call is a just `C` function in the kernel space. In our case first system call is write. This system call defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
|
||||
size_t, count)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Or in other words:
|
||||
|
||||
```C
|
||||
ssize_t write(int fd, const void *buf, size_t nbytes);
|
||||
```
|
||||
|
||||
Don't worry about the `SYSCALL_DEFINE3` macro for now, we'll come back to it.
|
||||
|
||||
The second part of our example is the same, but we call other system call. In this case we call [exit](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L69) system call. This system call gets only one parameter:
|
||||
|
||||
* Return value
|
||||
|
||||
and handles the way our program exits. We can pass the program name of our program to the [strace](https://en.wikipedia.org/wiki/Strace) util and we will see our system calls:
|
||||
|
||||
```
|
||||
$ strace test
|
||||
execve("./test", ["./test"], [/* 62 vars */]) = 0
|
||||
write(1, "Hello, world!\n", 14Hello, world!
|
||||
) = 14
|
||||
_exit(0) = ?
|
||||
|
||||
+++ exited with 0 +++
|
||||
```
|
||||
|
||||
In the first line of the `strace` output, we can see [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass the parameter through the general purpose registers in our example. The order of the registers is not not accidental. The order of the registers is defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This and other agreement for the `x86_64` architecture explained in the special document - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is:
|
||||
|
||||
* `rdi`;
|
||||
* `rsi`;
|
||||
* `rdx`;
|
||||
* `rcx`;
|
||||
* `r8`;
|
||||
* `r9`.
|
||||
|
||||
for the first six parameters of a function. If a function has more than six arguments, other parameters will be placed on the stack.
|
||||
|
||||
We do not use system calls in our code directly, but our program uses it when we want to print something, check access to a file or just write or read something to it.
|
||||
|
||||
For example:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
FILE *fp;
|
||||
char buff[255];
|
||||
|
||||
fp = fopen("test.txt", "r");
|
||||
fgets(buff, 255, fp);
|
||||
printf("%s\n", buff);
|
||||
fclose(fp);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
There are no `fopen`, `fgets`, `printf` and `fclose` system calls in the Linux kernel, but `open`, `read` `write` and `close` instead. I think you know that these four functions `fopen`, `fgets`, `printf` and `fclose` are just functions that defined in the `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library). Actually these functions are wrappers for the system calls. We do not call system calls directly in our code, but using [wrapper](https://en.wikipedia.org/wiki/Wrapper_function) functions from the standard library. The main reason of this is simple: a system call must be performed quickly, very quickly. As a system call must be quick, it must be small. The standard library takes responsibility to perform system calls with the correct set parameters and makes different check before it will call the given system call. Let's compile our program with the following command:
|
||||
|
||||
```
|
||||
$ gcc test.c -o test
|
||||
```
|
||||
|
||||
and look on it with the [ltrace](https://en.wikipedia.org/wiki/Ltrace) util:
|
||||
|
||||
```
|
||||
$ ltrace ./test
|
||||
__libc_start_main([ "./test" ] <unfinished ...>
|
||||
fopen("test.txt", "r") = 0x602010
|
||||
fgets("Hello World!\n", 255, 0x602010) = 0x7ffd2745e700
|
||||
puts("Hello World!\n"Hello World!
|
||||
|
||||
) = 14
|
||||
fclose(0x602010) = 0
|
||||
+++ exited (status 0) +++
|
||||
```
|
||||
|
||||
The `ltrace` util displays a set of userspace calls of a program. The `fopen` function opens the given text file, the `fgets` reads file content to the `buf` buffer, the `puts` function prints it to the `stdout` and the `fclose` function closes file by the given file descriptor. And as I already wrote, all of these functions call an appropriate system call. For example `puts` calls the `write` system call inside, we can see it if we will add `-S` option to the `ltrace` program:
|
||||
|
||||
```
|
||||
write@SYS(1, "Hello World!\n\n", 14) = 14
|
||||
```
|
||||
|
||||
Yes, system calls are ubiquitous. Each program needs to open/write/read file, network connection, allocate memory and many other things that can be provided only by the kernel. The [proc](https://en.wikipedia.org/wiki/Procfs) file system contains special file in a format: `/proc/pid/systemcall` that exposes the system call number and argument registers for the system call currently being executed by the process. For example, pid 1, that is [systemd](https://en.wikipedia.org/wiki/Systemd) for me:
|
||||
|
||||
```
|
||||
$ sudo cat /proc/1/comm
|
||||
systemd
|
||||
|
||||
$ sudo cat /proc/1/syscall
|
||||
232 0x4 0x7ffdf82e11b0 0x1f 0xffffffff 0x100 0x7ffdf82e11bf 0x7ffdf82e11a0 0x7f9114681193
|
||||
```
|
||||
|
||||
the system call with number - `232` which is [epoll_wait](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L241) system call that waits for an I/O event on an [epoll](https://en.wikipedia.org/wiki/Epoll) file descriptor. Or for example `emacs` editor where I'm writing this part:
|
||||
|
||||
```
|
||||
$ ps ax | grep emacs
|
||||
2093 ? Sl 2:40 emacs
|
||||
|
||||
$ sudo cat /proc/2093/comm
|
||||
emacs
|
||||
|
||||
$ sudo cat /proc/2093/syscall
|
||||
270 0xf 0x7fff068a5a90 0x7fff068a5b10 0x0 0x7fff068a59c0 0x7fff068a59d0 0x7fff068a59b0 0x7f777dd8813c
|
||||
```
|
||||
|
||||
the system call with the number `270` which is [sys_pselect6](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L279) system call that allows `emacs` to monitor multiple file descriptors.
|
||||
|
||||
Now we know a little about system call, what is it and why we need in it. So let's look at the `write` system call that our program used.
|
||||
|
||||
Implementation of write system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Let's look at the implementation of this system call directly in the source code of the Linux kernel. As we already know, the `write` system call is defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like this:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
|
||||
size_t, count)
|
||||
{
|
||||
struct fd f = fdget_pos(fd);
|
||||
ssize_t ret = -EBADF;
|
||||
|
||||
if (f.file) {
|
||||
loff_t pos = file_pos_read(f.file);
|
||||
ret = vfs_write(f.file, buf, count, &pos);
|
||||
if (ret >= 0)
|
||||
file_pos_write(f.file, pos);
|
||||
fdput_pos(f);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
First of all, the `SYSCALL_DEFINE3` macro is defined in the [include/linux/syscalls.h](https://github.com/torvalds/linux/blob/master/include/linux/syscalls.h) header file and expands to the definition of the `sys_name(...)` function. Let's look at this macro:
|
||||
|
||||
```C
|
||||
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
|
||||
|
||||
#define SYSCALL_DEFINEx(x, sname, ...) \
|
||||
SYSCALL_METADATA(sname, x, __VA_ARGS__) \
|
||||
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
|
||||
```
|
||||
|
||||
As we can see the `SYSCALL_DEFINE3` macro takes `name` parameter which will represent name of a system call and variadic number of parameters. This macro just expands to the `SYSCALL_DEFINEx` macro that takes the number of the parameters the given system call, the `_##name` stub for the future name of the system call (more about tokens concatenation with the `##` you can read in the [documentation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) of [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)). Next we can see the `SYSCALL_DEFINEx` macro. This macro expands to the two following macros:
|
||||
|
||||
* `SYSCALL_METADATA`;
|
||||
* `__SYSCALL_DEFINEx`.
|
||||
|
||||
Implementation of the first macro `SYSCALL_METADATA` depends on the `CONFIG_FTRACE_SYSCALLS` kernel configuration option. As we can understand from the name of this option, it allows to enable tracer to catch the syscall entry and exit events. If this kernel configration option is enabled, the `SYSCALL_METADATA` macro executes initialization of the `syscall_metadata` structure that defined in the [include/trace/syscall.h](https://github.com/torvalds/linux/blob/master/include/trace/syscall.h) header file and contains different useful fields as name of a system call, number of a system call in the system call [table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl), number of parameters of a system call, list of parameter types and etc:
|
||||
|
||||
```C
|
||||
#define SYSCALL_METADATA(sname, nb, ...) \
|
||||
... \
|
||||
... \
|
||||
... \
|
||||
struct syscall_metadata __used \
|
||||
__syscall_meta_##sname = { \
|
||||
.name = "sys"#sname, \
|
||||
.syscall_nr = -1, \
|
||||
.nb_args = nb, \
|
||||
.types = nb ? types_##sname : NULL, \
|
||||
.args = nb ? args_##sname : NULL, \
|
||||
.enter_event = &event_enter_##sname, \
|
||||
.exit_event = &event_exit_##sname, \
|
||||
.enter_fields = LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \
|
||||
}; \
|
||||
|
||||
static struct syscall_metadata __used \
|
||||
__attribute__((section("__syscalls_metadata"))) \
|
||||
*__p_syscall_meta_##sname = &__syscall_meta_##sname;
|
||||
```
|
||||
|
||||
If the `CONFIG_FTRACE_SYSCALLS` kernel option does not enabled during kernel configuration, in this way the `SYSCALL_METADATA` macro expands to empty string:
|
||||
|
||||
```C
|
||||
#define SYSCALL_METADATA(sname, nb, ...)
|
||||
```
|
||||
|
||||
The second macro `__SYSCALL_DEFINEx` expands to the definition of the five following functions:
|
||||
|
||||
```C
|
||||
#define __SYSCALL_DEFINEx(x, name, ...) \
|
||||
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
|
||||
__attribute__((alias(__stringify(SyS##name)))); \
|
||||
\
|
||||
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__)); \
|
||||
\
|
||||
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \
|
||||
\
|
||||
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
|
||||
{ \
|
||||
long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__)); \
|
||||
__MAP(x,__SC_TEST,__VA_ARGS__); \
|
||||
__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
|
||||
return ret; \
|
||||
} \
|
||||
\
|
||||
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
|
||||
```
|
||||
|
||||
The first `sys##name` is definition of the syscall handler function with the given name - `sys_system_call_name`. The `__SC_DECL` macro takes the `__VA_ARGS__` and combines call input parameter system type and the parameter name, because the macro definition is unable to determine the parameter types. And the `__MAP` macro applyes `__SC_DECL` macro to the `__VA_ARGS__` arguments. The other functions that are generated by the `__SYSCALL_DEFINEx` macro are need to protect from the [CVE-2009-0029](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0029) and we will not dive into details about this here. Ok, as result of the `SYSCALL_DEFINE3` macro, we will have:
|
||||
|
||||
```C
|
||||
asmlinkage long sys_write(unsigned int fd, const char __user * filename, size_t count);
|
||||
```
|
||||
|
||||
Now we know a little about the system call's definition and we can go back to the implementation of the `write` system call. Let's look on the implementation of this system call again:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
|
||||
size_t, count)
|
||||
{
|
||||
struct fd f = fdget_pos(fd);
|
||||
ssize_t ret = -EBADF;
|
||||
|
||||
if (f.file) {
|
||||
loff_t pos = file_pos_read(f.file);
|
||||
ret = vfs_write(f.file, buf, count, &pos);
|
||||
if (ret >= 0)
|
||||
file_pos_write(f.file, pos);
|
||||
fdput_pos(f);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
As we already know and can see from the code, it takes three arguments:
|
||||
|
||||
* `fd` - file descriptor;
|
||||
* `buf` - buffer to write;
|
||||
* `count` - length of buffer to write.
|
||||
|
||||
and writes data from a buffer declared by the user to a given device or a file. Note that the second parameter `buf`, defined with the `__user` attribute. The main purpose of this attribute is for checking the Linux kernel code with the [sparse](https://en.wikipedia.org/wiki/Sparse) util. It is defined in the [include/linux/compiler.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) header file and depends on the `__CHECKER__` definition in the Linux kernel. That's all about useful meta-information related to our `sys_write` system call, let's try to understand how this system call is implemented. As we can see it starts from the definition of the `f` structure that has `fd` structure type that represent file descriptor in the Linux kernel and we put the result of the call of the `fdget_pos` function. The `fdget_pos` function defined in the same [source](https://github.com/torvalds/linux/blob/master/fs/read_write.c) code file and just expands the call of the `__to_fd` function:
|
||||
|
||||
```C
|
||||
static inline struct fd fdget_pos(int fd)
|
||||
{
|
||||
return __to_fd(__fdget_pos(fd));
|
||||
}
|
||||
```
|
||||
|
||||
The main purpose of the `fdget_pos` is to convert the given file descriptor which is just a number to the `fd` structure. Through the long chain of function calls, the `fdget_pos` function gets the file descriptor table of the current process, `current->files`, and tries to find a corresponding file descriptor number there. As we got the `fd` structure for the given file descriptor number, we check it and return if it does not exist. We get the current position in the file with the call of the `file_pos_read` function that just returns `f_pos` field of the our file:
|
||||
|
||||
```C
|
||||
static inline loff_t file_pos_read(struct file *file)
|
||||
{
|
||||
return file->f_pos;
|
||||
}
|
||||
```
|
||||
|
||||
and call the `vfs_write` function. The `vfs_write` function defined the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and does the work for us - writes given buffer to the given file starting from the given position. We will not dive into details about the `vfs_write` function, because this function is weakly related to the `system call` concept but mostly about [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) concept which we will see in another chapter. After the `vfs_write` has finished its work, we check the result and if it was finished successfully we change the position in the file with the `file_pos_write` function:
|
||||
|
||||
```C
|
||||
if (ret >= 0)
|
||||
file_pos_write(f.file, pos);
|
||||
```
|
||||
|
||||
that just updates `f_pos` with the given position in the given file:
|
||||
|
||||
```C
|
||||
static inline void file_pos_write(struct file *file, loff_t pos)
|
||||
{
|
||||
file->f_pos = pos;
|
||||
}
|
||||
```
|
||||
|
||||
At the end of the our `write` system call handler, we can see the call of the following function:
|
||||
|
||||
```C
|
||||
fdput_pos(f);
|
||||
```
|
||||
|
||||
unlocks the `f_pos_lock` mutex that protects file position during concurrent writes from threads that share file descriptor.
|
||||
|
||||
That's all.
|
||||
|
||||
We have seen the partial implementation of one system call provided by the Linux kernel. Of course we have missed some parts in the implementation of the `write` system call, because as I mentioned above, we will see only system calls related stuff in this chapter and will not see other stuff related to other subsystems, such as [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system).
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This concludes the first part covering system call concepts in the Linux kernel. We have covered the theory of system calls so far and in the next part we will continue to dive into this topic, touching Linux kernel code related to system calls.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [vdso](https://en.wikipedia.org/wiki/VDSO)
|
||||
* [vsyscall](https://lwn.net/Articles/446528/)
|
||||
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [socket](https://en.wikipedia.org/wiki/Network_socket)
|
||||
* [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29)
|
||||
* [x86](https://en.wikipedia.org/wiki/X86)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions)
|
||||
* [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [Intel manual. PDF](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
|
||||
* [system call table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)
|
||||
* [GCC macro documentation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
|
||||
* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
|
||||
* [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29)
|
||||
* [strace](https://en.wikipedia.org/wiki/Strace)
|
||||
* [standard library](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [wrapper functions](https://en.wikipedia.org/wiki/Wrapper_function)
|
||||
* [ltrace](https://en.wikipedia.org/wiki/Ltrace)
|
||||
* [sparse](https://en.wikipedia.org/wiki/Sparse)
|
||||
* [proc file system](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [systemd](https://en.wikipedia.org/wiki/Systemd)
|
||||
* [epoll](https://en.wikipedia.org/wiki/Epoll)
|
||||
* [Previous chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)
|
@ -0,0 +1,409 @@
|
||||
System calls in the Linux kernel. Part 2.
|
||||
================================================================================
|
||||
|
||||
How does the Linux kernel handle a system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) was the first part of the chapter that describes the [system call](https://en.wikipedia.org/wiki/System_call) concepts in the Linux kernel.
|
||||
In the previous part we learned what a system call is in the Linux kernel, and in operating systems in general. This was introduced from a user-space perspective, and part of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call implementation was discussed. In this part we continue our look at system calls, starting with some theory before moving onto the Linux kernel code.
|
||||
|
||||
An user application does not make the system call directly from our applications. We did not write the `Hello world!` program like:
|
||||
|
||||
```C
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
sys_write(fd1, buf, strlen(buf));
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
We can use something similar with the help of [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library) and it will look something like this:
|
||||
|
||||
```C
|
||||
#include <unistd.h>
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
write(fd1, buf, strlen(buf));
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
But anyway, `write` is not a direct system call and not a kernel function. An application must fill general purpose registers with the correct values in the correct order and use the `syscall` instruction to make the actual system call. In this part we will look at what occurs in the Linux kernel when the `syscall` instruction is met by the processor.
|
||||
|
||||
Initialization of the system calls table
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
From the previous part we know that system call concept is very similar to an interrupt. Furthermore, system calls are implemented as software interrupts. So, when the processor handles a `syscall` instruction from a user application, this instruction causes an exception which transfers control to an exception handler. As we know, all exception handlers (or in other words kernel [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) functions that will react on an exception) are placed in the kernel code. But how does the Linux kernel search for the address of the necessary system call handler for the related system call? The Linux kernel contains a special table called the `system call table`. The system call table is represented by the `sys_call_table` array in the Linux kernel which is defined in the [arch/x86/entry/syscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) source code file. Let's look at its implementation:
|
||||
|
||||
```C
|
||||
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
|
||||
[0 ... __NR_syscall_max] = &sys_ni_syscall,
|
||||
#include <asm/syscalls_64.h>
|
||||
};
|
||||
```
|
||||
|
||||
As we can see, the `sys_call_table` is an array of `__NR_syscall_max + 1` size where the `__NR_syscall_max` macro represents the maximum number of system calls for the given [architecture](https://en.wikipedia.org/wiki/List_of_CPU_architectures). This book is about the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so for our case the `__NR_syscall_max` is `322` and this is the correct number at the time of writing (current Linux kernel version is `4.2.0-rc8+`). We can see this macro in the header file generated by [Kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt) during kernel compilation - include/generated/asm-offsets.h`:
|
||||
|
||||
```C
|
||||
#define __NR_syscall_max 322
|
||||
```
|
||||
|
||||
There will be the same number of system calls in the [arch/x86/entry/syscalls/syscall_64.tbl](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L331) for the `x86_64`. There are two important topics here; the type of the `sys_call_table` array, and the initialization of elements in this array. First of all, the type. The `sys_call_ptr_t` represents a pointer to a system call table. It is defined as [typedef](https://en.wikipedia.org/wiki/Typedef) for a function pointer that returns nothing and and does not take arguments:
|
||||
|
||||
```C
|
||||
typedef void (*sys_call_ptr_t)(void);
|
||||
```
|
||||
|
||||
The second thing is the initialization of the `sys_call_table` array. As we can see in the code above, all elements of our array that contain pointers to the system call handlers point to the `sys_ni_syscall`. The `sys_ni_syscall` function represents not-implemented system calls. To start with, all elements of the `sys_call_table` array point to the not-implemented system call. This is the correct initial behaviour, because we only initialize storage of the pointers to the system call handlers, it is populated later on. Implementation of the `sys_ni_syscall` is pretty easy, it just returns [-errno](http://man7.org/linux/man-pages/man3/errno.3.html) or `-ENOSYS` in our case:
|
||||
|
||||
```C
|
||||
asmlinkage long sys_ni_syscall(void)
|
||||
{
|
||||
return -ENOSYS;
|
||||
}
|
||||
```
|
||||
|
||||
The `-ENOSYS` error tells us that:
|
||||
|
||||
```
|
||||
ENOSYS Function not implemented (POSIX.1)
|
||||
```
|
||||
|
||||
Also a note on `...` in the initialization of the `sys_call_table`. We can do it with a [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) compiler extension called - [Designated Initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html). This extension allows us to initialize elements in non-fixed order. As you can see, we include the `asm/syscalls_64.h` header at the end of the array. This header file is generated by the special script at [arch/x86/entry/syscalls/syscalltbl.sh](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscalltbl.sh) and generates our header file from the [syscall table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl). The `asm/syscalls_64.h` contains definitions of the following macros:
|
||||
|
||||
```C
|
||||
__SYSCALL_COMMON(0, sys_read, sys_read)
|
||||
__SYSCALL_COMMON(1, sys_write, sys_write)
|
||||
__SYSCALL_COMMON(2, sys_open, sys_open)
|
||||
__SYSCALL_COMMON(3, sys_close, sys_close)
|
||||
__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `__SYSCALL_COMMON` macro is defined in the same source code [file](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) and expands to the `__SYSCALL_64` macro which expands to the function definition:
|
||||
|
||||
```C
|
||||
#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
|
||||
#define __SYSCALL_64(nr, sym, compat) [nr] = sym,
|
||||
```
|
||||
|
||||
So, after this, our `sys_call_table` takes the following form:
|
||||
|
||||
```C
|
||||
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
|
||||
[0 ... __NR_syscall_max] = &sys_ni_syscall,
|
||||
[0] = sys_read,
|
||||
[1] = sys_write,
|
||||
[2] = sys_open,
|
||||
...
|
||||
...
|
||||
...
|
||||
};
|
||||
```
|
||||
|
||||
After this all elements that point to the non-implemented system calls will contain the address of the `sys_ni_syscall` function that just returns `-ENOSYS` as we saw above, and other elements will point to the `sys_syscall_name` functions.
|
||||
|
||||
At this point, we have filled the system call table and the Linux kernel knows where each system call handler is. But the Linux kernel does not call a `sys_syscall_name` function immediately after it is instructed to handle a system call from a user space application. Remember the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more tasks before it will call an interrupt handler. There is the same situation with the system call handling. The preparation for handling a system call is the first thing, but before the Linux kernel will start these preparations, the entry point of a system call must be initailized and only the Linux kernel knows how to perform this preparation. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel.
|
||||
|
||||
Initialization of the system call entry
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
When a system call occurs in the system, where are the first bytes of code that starts to handle it? As we can read in the Intel manual - [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html):
|
||||
|
||||
```
|
||||
SYSCALL invokes an OS system-call handler at privilege level 0.
|
||||
It does so by loading RIP from the IA32_LSTAR MSR
|
||||
```
|
||||
|
||||
it means that we need to put the system call entry in to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during the Linux kernel initialization process. If you have read the fourth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that the Linux kernel calls the `trap_init` function during the initialization process. This function is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and executes the initialization of the `non-early` exception handlers like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file.
|
||||
|
||||
This function performs the initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all it fills two model specific registers:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);
|
||||
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
|
||||
```
|
||||
|
||||
The first model specific register - `MSR_STAR` contains `63:48` bits of the user code segment. These bits will be loaded to the `CS` and `SS` segment registers for the `sysret` instruction which provides functionality to return from a system call to user code with the related privilege. Also the `MSR_STAR` contains `47:32` bits from the kernel code that will be used as the base selector for `CS` and `SS` segment registers when user space applications execute a system call. In the second line of code we fill the `MSR_LSTAR` register with the `entry_SYSCALL_64` symbol that represents system call entry. The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and contains code related to the preparation peformed before a system call handler will be executed (I already wrote about these preparations, read above). We will not consider the `entry_SYSCALL_64` now, but will return to it later in this chapter.
|
||||
|
||||
After we have set the entry point for system calls, we need to set the following model specific registers:
|
||||
|
||||
* `MSR_CSTAR` - target `rip` for the compability mode callers;
|
||||
* `MSR_IA32_SYSENTER_CS` - target `cs` for the `sysenter` instruction;
|
||||
* `MSR_IA32_SYSENTER_ESP` - target `esp` for the `sysenter` instruction;
|
||||
* `MSR_IA32_SYSENTER_EIP` - target `eip` for the `sysenter` instruction.
|
||||
|
||||
The values of these model specific register depend on the `CONFIG_IA32_EMULATION` kernel configuration option. If this kernel configuration option is enabled, it allows legacy 32-bit programs to run under a 64-bit kernel. In the first case, if the `CONFIG_IA32_EMULATION` kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compability mode:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_CSTAR, entry_SYSCALL_compat);
|
||||
```
|
||||
|
||||
and with the kernel code segment, put zero to the stack pointer and write the address of the `entry_SYSENTER_compat` symbol to the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter):
|
||||
|
||||
```C
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
|
||||
```
|
||||
|
||||
In another way, if the `CONFIG_IA32_EMULATION` kernel configuration option is disabled, we write `ignore_sysret` symbol to the `MSR_CSTAR`:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_CSTAR, ignore_sysret);
|
||||
```
|
||||
|
||||
that is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and just returns `-ENOSYS` error code:
|
||||
|
||||
```assembly
|
||||
ENTRY(ignore_sysret)
|
||||
mov $-ENOSYS, %eax
|
||||
sysret
|
||||
END(ignore_sysret)
|
||||
```
|
||||
|
||||
Now we need to fill `MSR_IA32_SYSENTER_CS`, `MSR_IA32_SYSENTER_ESP`, `MSR_IA32_SYSENTER_EIP` model specific registers as we did in the previous code when the `CONFIG_IA32_EMULATION` kernel configuration option was enabled. In this case (when the `CONFIG_IA32_EMULATION` configuration option is not set) we fill the `MSR_IA32_SYSENTER_ESP` and the `MSR_IA32_SYSENTER_EIP` with zero and put the invalid segment of the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) to the `MSR_IA32_SYSENTER_CS` model specific register:
|
||||
|
||||
```C
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
|
||||
```
|
||||
|
||||
You can read more about the `Global Descriptor Table` in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the chapter that describes the booting process of the Linux kernel.
|
||||
|
||||
At the end of the `syscall_init` function, we just mask flags in the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) by writing the set of flags to the `MSR_SYSCALL_MASK` model specific register:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_SYSCALL_MASK,
|
||||
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
|
||||
X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
|
||||
```
|
||||
|
||||
These flags will be cleared during syscall initialization. That's all, it is the end of the `syscall_init` function and it means that system call entry is ready to work. Now we can see what will occur when an user application executes the `syscall` instruction.
|
||||
|
||||
Preparation before system call handler will be called
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote, before a system call or an interrupt handler will be called by the Linux kernel we need to do some preparations. The `idtentry` macro performs the preparations required before an exception handler will be executed, the `interrupt` macro performs the preparations requires before an interrupt handler will be called and the `entry_SYSCALL_64` will do the preparations required before a system call handler will be executed.
|
||||
|
||||
The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and starts from the following macro:
|
||||
|
||||
```assembly
|
||||
SWAPGS_UNSAFE_STACK
|
||||
```
|
||||
|
||||
This macro is defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) header file and expands to the `swapgs` instruction:
|
||||
|
||||
```C
|
||||
#define SWAPGS_UNSAFE_STACK swapgs
|
||||
```
|
||||
|
||||
which exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved it on to the kernel stack. After this we point the old stack pointer to the `rsp_scratch` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable and setup the stack pointer to point to the top of stack for the current processor:
|
||||
|
||||
```assembly
|
||||
movq %rsp, PER_CPU_VAR(rsp_scratch)
|
||||
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
|
||||
```
|
||||
|
||||
In the next step we push the stack segment and the old stack pointer to the stack:
|
||||
|
||||
```assembly
|
||||
pushq $__USER_DS
|
||||
pushq PER_CPU_VAR(rsp_scratch)
|
||||
```
|
||||
|
||||
After this we enable interrupts, because interrupts are `off` on entry and save the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register) (besides `bp`, `bx` and from `r12` to `r15`), flags, `-ENOSYS` for the non-implemented system call and code segment register on the stack:
|
||||
|
||||
```assembly
|
||||
ENABLE_INTERRUPTS(CLBR_NONE)
|
||||
|
||||
pushq %r11
|
||||
pushq $__USER_CS
|
||||
pushq %rcx
|
||||
pushq %rax
|
||||
pushq %rdi
|
||||
pushq %rsi
|
||||
pushq %rdx
|
||||
pushq %rcx
|
||||
pushq $-ENOSYS
|
||||
pushq %r8
|
||||
pushq %r9
|
||||
pushq %r10
|
||||
pushq %r11
|
||||
sub $(6*8), %rsp
|
||||
```
|
||||
|
||||
When a system call occurs from the user's application, general purpose registers have the following state:
|
||||
|
||||
* `rax` - contains system call number;
|
||||
* `rcx` - contains return address to the user space;
|
||||
* `r11` - contains register flags;
|
||||
* `rdi` - contains first argument of a system call handler;
|
||||
* `rsi` - contains second argument of a system call handler;
|
||||
* `rdx` - contains third argument of a system call handler;
|
||||
* `r10` - contains fourth argument of a system call handler;
|
||||
* `r8` - contains fifth argument of a system call handler;
|
||||
* `r9` - contains sixth argument of a system call handler;
|
||||
|
||||
Other general purpose registers (as `rbp`, `rbx` and from `r12` to `r15`) are callee-preserved in [C ABI](http://www.x86-64.org/documentation/abi.pdf)). So we push register flags on the top of the stack, then user code segment, return address to the user space, system call number, first three arguments, dump error code for the non-implemented system call and other arguments on the stack.
|
||||
|
||||
In the next step we check the `_TIF_WORK_SYSCALL_ENTRY` in the current `thread_info`:
|
||||
|
||||
```assembly
|
||||
testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
|
||||
jnz tracesys
|
||||
```
|
||||
|
||||
The `_TIF_WORK_SYSCALL_ENTRY` macro is defined in the [arch/x86/include/asm/thread_info.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/thread_info.h) header file and provides set of the thread information flags that are related to the system calls tracing:
|
||||
|
||||
```C
|
||||
#define _TIF_WORK_SYSCALL_ENTRY \
|
||||
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
|
||||
_TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
|
||||
_TIF_NOHZ)
|
||||
```
|
||||
|
||||
We will not consider debugging/tracing related stuff in this chapter, but will see it in the separate chapter that will be devoted to the debugging and tracing techniques in the Linux kernel. After the `tracesys` label, the next label is the `entry_SYSCALL_64_fastpath`. In the `entry_SYSCALL_64_fastpath` we check the `__SYSCALL_MASK` that is defined in the [arch/x86/include/asm/unistd.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/unistd.h) header file and
|
||||
|
||||
```C
|
||||
# ifdef CONFIG_X86_X32_ABI
|
||||
# define __SYSCALL_MASK (~(__X32_SYSCALL_BIT))
|
||||
# else
|
||||
# define __SYSCALL_MASK (~0)
|
||||
# endif
|
||||
```
|
||||
|
||||
where the `__X32_SYSCALL_BIT` is
|
||||
|
||||
```C
|
||||
#define __X32_SYSCALL_BIT 0x40000000
|
||||
```
|
||||
|
||||
As we can see the `__SYSCALL_MASK` depends on the `CONFIG_X86_X32_ABI` kernel configuration option and represents the mask for the 32-bit [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) in the 64-bit kernel.
|
||||
|
||||
So we check the value of the `__SYSCALL_MASK` and if the `CONFIG_X86_X32_ABI` is disabled we compare the value of the `rax` register to the maximum syscall number (`__NR_syscall_max`), alternatively if the `CNOFIG_X86_X32_ABI` is enabled we mask the `eax` register with the `__X32_SYSCALL_BIT` and do the same comparison:
|
||||
|
||||
```assembly
|
||||
#if __SYSCALL_MASK == ~0
|
||||
cmpq $__NR_syscall_max, %rax
|
||||
#else
|
||||
andl $__SYSCALL_MASK, %eax
|
||||
cmpl $__NR_syscall_max, %eax
|
||||
#endif
|
||||
```
|
||||
|
||||
After this we check the result of the last comparison with the `ja` instruction that executes if `CF` and `ZF` flags are zero:
|
||||
|
||||
```assembly
|
||||
ja 1f
|
||||
```
|
||||
|
||||
and if we have the correct system call for this, we move the fourth argument from the `r10` to the `rcx` to keep [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) compliant and execute the `call` instruction with the address of a system call handler:
|
||||
|
||||
```assembly
|
||||
movq %r10, %rcx
|
||||
call *sys_call_table(, %rax, 8)
|
||||
```
|
||||
|
||||
Note, the `sys_call_table` is an array that we saw above in this part. As we already know the `rax` general purpose register contains the number of a system call and each element of the `sys_call_table` is 8-bytes. So we are using `*sys_call_table(, %rax, 8)` this notation to find the correct offset in the `sys_call_table` array for the given system call handler.
|
||||
|
||||
That's all. We did all the required preparations and the system call handler was called for the given interrupt handler, for example `sys_read`, `sys_write` or other system call handler that is defined with the `SYSCALL_DEFINE[N]` macro in the Linux kernel code.
|
||||
|
||||
Exit from a system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After a system call handler finishes its work, we will return back to the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S), right after where we have called the system call handler:
|
||||
|
||||
```assembly
|
||||
call *sys_call_table(, %rax, 8)
|
||||
```
|
||||
|
||||
The next step after we've returned from a system call handler is to put the return value of a system handler on to the stack. We know that a system call returns the result to the user program in the general purpose `rax` register, so we are moving its value on to the stack after the system call handler has finished its work:
|
||||
|
||||
```C
|
||||
movq %rax, RAX(%rsp)
|
||||
```
|
||||
|
||||
on the `RAX` place.
|
||||
|
||||
After this we can see the call of the `LOCKDEP_SYS_EXIT` macro from the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h):
|
||||
|
||||
```assembly
|
||||
LOCKDEP_SYS_EXIT
|
||||
```
|
||||
|
||||
The implementation of this macro depends on the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option that allows us to debug locks on exit from a system call. And again, we will not consider it in this chapter, but will return to it in a separate one. In the end of the `entry_SYSCALL_64` function we restore all general purpose registers besides `rxc` and `r11`, because the `rcx` register must contain the return address to the application that called system call and the `r11` register contains the old [flags register](https://en.wikipedia.org/wiki/FLAGS_register). After all general purpose registers are restored, we fill `rcx` with the return address, `r11` register with the flags and `rsp` with the old stack pointer:
|
||||
|
||||
```assembly
|
||||
RESTORE_C_REGS_EXCEPT_RCX_R11
|
||||
|
||||
movq RIP(%rsp), %rcx
|
||||
movq EFLAGS(%rsp), %r11
|
||||
movq RSP(%rsp), %rsp
|
||||
|
||||
USERGS_SYSRET64
|
||||
```
|
||||
|
||||
In the end we just call the `USERGS_SYSRET64` macro that expands to the call of the `swapgs` instruction which exchanges again the user `GS` and kernel `GS` and the `sysretq` instruction which executes on exit from a system call handler:
|
||||
|
||||
```C
|
||||
#define USERGS_SYSRET64 \
|
||||
swapgs; \
|
||||
sysretq;
|
||||
```
|
||||
|
||||
Now we know what occurs when an user application calls a system call. The full path of this process is as follows:
|
||||
|
||||
* User application contains code that fills general purposer register with the values (system call number and arguments of this system call);
|
||||
* Processor switches from the user mode to kernel mode and starts execution of the system call entry - `entry_SYSCALL_64`;
|
||||
* `entry_SYSCALL_64` switches to the kernel stack and saves some general purpose registers, old stack and code segment, flags and etc... on the stack;
|
||||
* `entry_SYSCALL_64` checks the system call number in the `rax` register, searches a system call handler in the `sys_call_table` and calls it, if the number of a system call is correct;
|
||||
* If a system call is not correct, jump on exit from system call;
|
||||
* After a system call handler will finish its work, restore general purposer registers, old stack, flags and return address and exit from the `entry_SYSCALL_64` with the `sysretq` instruction.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) we saw theory about this concept from the user application view. In this part we continued to dive into the stuff which is related to the system call concept and saw what the Linux kernel does when a system call occurs.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [write](http://man7.org/linux/man-pages/man2/write.2.html)
|
||||
* [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [list of cpu architectures](https://en.wikipedia.org/wiki/List_of_CPU_architectures)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt)
|
||||
* [typedef](https://en.wikipedia.org/wiki/Typedef)
|
||||
* [errno](http://man7.org/linux/man-pages/man3/errno.3.html)
|
||||
* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [model specific register](https://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [intel 2b manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
|
||||
* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
|
||||
* [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [previous chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html)
|
@ -0,0 +1,404 @@
|
||||
System calls in the Linux kernel. Part 3.
|
||||
================================================================================
|
||||
|
||||
vsyscalls and vDSO
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes system calls in the Linux kernel and we saw preparations after a system call caused by an userspace application and process of handling of a system call in the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html). In this part we will look at two concepts that are very close to the system call concept, they are called `vsyscall` and `vdso`.
|
||||
|
||||
We already know what is a `system call`. This is special routine in the Linux kernel which userspace application asks to do privileged tasks, like to read or to write to a file, to open a socket and etc. As you maybe know, invoking a system call is an expensive operation in the Linux kernel, because the processor must interrupt the currently executing task and switch context to kernel mode, subsequently jumping again into userspace after the system call handler finishes its work. These two mechanisms - `vsyacall` and `vdso` are designed to speed up this process for certain system calls and in this part we will try to understand how these mechanisms are arranged.
|
||||
|
||||
Introduction to vsyscalls
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `vsyscall` or `virtual system call` is the first and older mechinism in the Linux kernel that designed to accelerate execution of the certain system calls. The principle of work of the `vsyscall` concept is simple. The Linux kernel maps into user space a page that contains some variables and the implementation of some system calls. We can find information about this memeory space in the Linux kernel [documentation](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) for the [x86_64](https://en.wikipedia.org/wiki/X86-64):
|
||||
|
||||
```
|
||||
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```
|
||||
~$ sudo cat /proc/1/maps | grep vsyscall
|
||||
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
|
||||
```
|
||||
|
||||
After this, these these system calls will be executed in userpsace and this means that there will not be [context switching](https://en.wikipedia.org/wiki/Context_switch). Mapping of the `vsyscall` page occurs in the `map_vsyscall` function that defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. This function is called during the Linux kernel intialization in the `setup_arch` function that defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file (we saw this function in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) of the Linux kernel initialization process chapter).
|
||||
|
||||
Note that implementation of the `map_vsyscall` function depends on the `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_VSYSCALL_EMULATION
|
||||
extern void map_vsyscall(void);
|
||||
#else
|
||||
static inline void map_vsyscall(void) {}
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can read in the help text, the `CONFIG_X86_VSYSCALL_EMULATION` configuration option: `Enable vsyscall emulation`. Why to emulate `vsyscall`? Actuall, the `vsyscall` is are a legacy [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) by the security reasons. Virtual system calls have fixed addresses that means that `vsyscall` page is still at the same location everytime and the localtion of this page determined in the `map_vsyscall` function. Let's look on the implementation of this function:
|
||||
|
||||
```C
|
||||
void __init map_vsyscall(void)
|
||||
{
|
||||
extern char __vsyscall_page;
|
||||
unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
As we can see, at the beginning of the `map_vsyscall` function we gets the physical address of the `vsyscall` page with the `__pa_symbol` macro (we already saw implementation if this macro in the fourth [path](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process). The `__vsyscall_page` symbol definied in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) assembly source code file and have the following [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space):
|
||||
|
||||
```
|
||||
ffffffff81881000 D __vsyscall_page
|
||||
```
|
||||
|
||||
in the `.data..page_aligned, aw` [section](https://en.wikipedia.org/wiki/Memory_segmentation) and contains call of the three folowing system calls:
|
||||
|
||||
* `gettimeofday`;
|
||||
* `time`;
|
||||
* `getcpu`.
|
||||
|
||||
Or:
|
||||
|
||||
```assembly
|
||||
__vsyscall_page:
|
||||
mov $__NR_gettimeofday, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_time, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_getcpu, %rax
|
||||
syscall
|
||||
ret
|
||||
```
|
||||
|
||||
Let's go back to the implementation of the `map_vsyscall` function, later we will return to the implementation of the `__vsyscall_page`. After we got the physical address of the `__vsyscall_page`, we check the value of the `vsyscall_mode` variable and sets the [fix-mapped](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) address for the `vsyscall` page with the `__set_fixmap` macro:
|
||||
|
||||
```C
|
||||
if (vsyscall_mode != NONE)
|
||||
__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
|
||||
vsyscall_mode == NATIVE
|
||||
? PAGE_KERNEL_VSYSCALL
|
||||
: PAGE_KERNEL_VVAR);
|
||||
```
|
||||
|
||||
The `__set_fixmap` takes three arguments: The first is index of the `fixed_addresses` [enum](https://en.wikipedia.org/wiki/Enumerated_type). In our case `VSYSCALL_PAGE` is the first element of the `fixed_addresses` enum for the `x86_64` architecture:
|
||||
|
||||
```C
|
||||
enum fixed_addresses {
|
||||
...
|
||||
...
|
||||
...
|
||||
#ifdef CONFIG_X86_VSYSCALL_EMULATION
|
||||
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
|
||||
#endif
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
It equal to the `511`. The second argument is the physical address of the the page that has to be mapped and the third argument is the flags of the page. Note that flags of the `VSYSCALL_PAGE` depends on the `vsyscall_mode` variable. It will be `PAGE_KERNEL_VSYSCALL` if the `vsyscall_mode` variable is `NATIVE` and the `PAGE_KERNEL_VVAR` in other way. Both macros (the `PAGE_KERNEL_VSYSCALL` and the `PAGE_KERNEL_VVAR`) will be expanded to the following flags:
|
||||
|
||||
```C
|
||||
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
|
||||
#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
|
||||
```
|
||||
|
||||
that represent access rights to the `vsyscall` page. Both flags have the same `_PAGE_USER` flags that means that the page can be accessed by a user-mode process running at lower privilege levels. And the second flag depends on the value of the `vsyscall_mode` variable. The first flag (`__PAGE_KERNEL_VSYSCALL`) will be set in a case if the `vsyscall_mode` will be `NATIVE`. This means virtual system calls will be native `syscall` instructions. In other way the vsyscall will have `PAGE_KERNEL_VVAR` if the `vsyscall_mode` variable will be `emulate`. In this case virtual system calls will be turned into traps and are emulated reasonably. The `vsyscall_mode` variable gets its value in the `vsyscall_setup` function:
|
||||
|
||||
```C
|
||||
static int __init vsyscall_setup(char *str)
|
||||
{
|
||||
if (str) {
|
||||
if (!strcmp("emulate", str))
|
||||
vsyscall_mode = EMULATE;
|
||||
else if (!strcmp("native", str))
|
||||
vsyscall_mode = NATIVE;
|
||||
else if (!strcmp("none", str))
|
||||
vsyscall_mode = NONE;
|
||||
else
|
||||
return -EINVAL;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
return -EINVAL;
|
||||
}
|
||||
```
|
||||
|
||||
That will be called during early kernel parameters parsing:
|
||||
|
||||
```C
|
||||
early_param("vsyscall", vsyscall_setup);
|
||||
```
|
||||
|
||||
More about `early_param` macro you can read in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the chapter that describes process of the initialization of the Linux kernel.
|
||||
|
||||
In the end of the `vsyscall_map` function we just check that virtual address of the `vsyscall` page is equal to the value of the `VSYSCALL_ADDR` with the [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) macro:
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
|
||||
(unsigned long)VSYSCALL_ADDR);
|
||||
```
|
||||
|
||||
That's all. `vsyscall` page is set up. The result of the all the above is following: If we pass `vsyscall=native` parameter to the kernel command line, virtual system calls will be handled as native `syscall` instructions in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S). The [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows addresses of the virtual system call handlers. Note that virtual system call handlers aligned by `1024` (or `0x400`) bytes:
|
||||
|
||||
```assembly
|
||||
__vsyscall_page:
|
||||
mov $__NR_gettimeofday, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_time, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_getcpu, %rax
|
||||
syscall
|
||||
ret
|
||||
```
|
||||
|
||||
And the start address of the `vsyscall` page is the `ffffffffff600000` everytime. So, the [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows addresses of the all virutal system call handlers. You can find definition of these addresses in the `glibc` source code:
|
||||
|
||||
```C
|
||||
#define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000
|
||||
#define VSYSCALL_ADDR_vtime 0xffffffffff600400
|
||||
#define VSYSCALL_ADDR_vgetcpu 0xffffffffff600800
|
||||
```
|
||||
|
||||
All virtual system call requests will fall into the `__vsyscall_page` + `VSYSCALL_ADDR_vsyscall_name` offset, put the number of a virtual system call to the `rax` general purpose [register](https://en.wikipedia.org/wiki/Processor_register) and the native for the x86_64 `syscall` instruction will be executed.
|
||||
|
||||
In the second case, if we pass `vsyscall=emulate` parameter to the kernel command line, attempt to perform virtual system call handler will cause [page fault](https://en.wikipedia.org/wiki/Page_fault) exception. Of course, remember, the `vsyscall` page has `__PAGE_KERNEL_VVAR` access rights that forbid execution. The `do_page_fault` function is the `#PF` or page fault handler. It tries to understand the reason of the last page fault. And one of the reason can be situation when virtual system call called and `vsyscall` mode is `emulate`. In this case `vsyscall` will be handled by the `emulate_vsyscall` function that defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file.
|
||||
|
||||
The `emulate_vsyscall` function gets the number of a virtual system call, checks it, prints error and sends [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault) single:
|
||||
|
||||
```C
|
||||
...
|
||||
...
|
||||
...
|
||||
vsyscall_nr = addr_to_vsyscall_nr(address);
|
||||
if (vsyscall_nr < 0) {
|
||||
warn_bad_vsyscall(KERN_WARNING, regs, "misaligned vsyscall...);
|
||||
goto sigsegv;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
sigsegv:
|
||||
force_sig(SIGSEGV, current);
|
||||
reutrn true;
|
||||
```
|
||||
|
||||
As it checked number of a virtual system call, it does some yet another checks like `access_ok` violations and execute system call function depends on the number of a virtual system call:
|
||||
|
||||
```C
|
||||
switch (vsyscall_nr) {
|
||||
case 0:
|
||||
ret = sys_gettimeofday(
|
||||
(struct timeval __user *)regs->di,
|
||||
(struct timezone __user *)regs->si);
|
||||
break;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
In the end we put the result of the `sys_gettimeofday` or another virtual system call handler to the `ax` general purpose register, as we did it with the normal system calls and restore the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and add `8` bytes to the [stack pointer](https://en.wikipedia.org/wiki/Stack_register) register. This operation emulates `ret` instruction.
|
||||
|
||||
```C
|
||||
regs->ax = ret;
|
||||
|
||||
do_ret:
|
||||
regs->ip = caller;
|
||||
regs->sp += 8;
|
||||
return true;
|
||||
```
|
||||
|
||||
That's all. Now let's look on the modern concept - `vDSO`.
|
||||
|
||||
Introduction to vDSO
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote above, `vsyscall` is obsolete concept and replaced by the `vDSO` or `virtual dynamic shared object`. The main difference between the `vsyscall` and `vDSO` mechanisms that `vDSO` maps memory pages into each process in a shared object [form](https://en.wikipedia.org/wiki/Library_%28computing%29#Shared_libraries), but `vsyscall` is static in memory and has the same address everytime. For the `x86_64` architecture it is called -`linux-vdso.so.1`. All userspace applications linked with this shared library via the `glibc`. For example:
|
||||
|
||||
```
|
||||
~$ ldd /bin/uname
|
||||
linux-vdso.so.1 (0x00007ffe014b7000)
|
||||
libc.so.6 => /lib64/libc.so.6 (0x00007fbfee2fe000)
|
||||
/lib64/ld-linux-x86-64.so.2 (0x00005559aab7c000)
|
||||
```
|
||||
|
||||
Or:
|
||||
|
||||
```
|
||||
~$ sudo cat /proc/1/maps | grep vdso
|
||||
7fff39f73000-7fff39f75000 r-xp 00000000 00:00 0 [vdso]
|
||||
```
|
||||
|
||||
Here we can see that [uname](https://en.wikipedia.org/wiki/Uname) util was linked with the three libraries:
|
||||
|
||||
* `linux-vdso.so.1`;
|
||||
* `libc.so.6`;
|
||||
* `ld-linux-x86-64.so.2`.
|
||||
|
||||
The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
|
||||
|
||||
Initialization of the `vDSO` occurs in the `init_vdso` function that defined in the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file. This function starts from the initialization of the `vDSO` images for 32-bits and 64-bits depends on the `CONFIG_X86_X32_ABI` kernel configuration option:
|
||||
|
||||
```C
|
||||
static int __init init_vdso(void)
|
||||
{
|
||||
init_vdso_image(&vdso_image_64);
|
||||
|
||||
#ifdef CONFIG_X86_X32_ABI
|
||||
init_vdso_image(&vdso_image_x32);
|
||||
#endif
|
||||
```
|
||||
|
||||
Both function makes initialization of the `vdso_image` structures. This structures defined in the two generated sourece code files: the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c) and the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c). These source code files generated by the [vdso2c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso2c.c) programm from the different soure code files that represent different approaches to call a system call like `int 0x80`, `sysenter` and etc. The full set of the images depends on the kernel configuration.
|
||||
|
||||
For example for the `x86_64` Linux kernel it will contain `vdso_image_64`:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
extern const struct vdso_image vdso_image_64;
|
||||
#endif
|
||||
```
|
||||
|
||||
But for the `x86` - `vdso_image_32`:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_X32
|
||||
extern const struct vdso_image vdso_image_x32;
|
||||
#endif
|
||||
```
|
||||
|
||||
If our kernel will configured for the `x86` architecture or for the `x86_64` and compability mode, we will have ability to call a system call with the `int 0x80` interrupt, if compability mode will be enabled, we will be able to call a system call with the native `syscall instruction` or `sysenter` instruction in other way:
|
||||
|
||||
```C
|
||||
#if defined CONFIG_X86_32 || defined CONFIG_COMPAT
|
||||
extern const struct vdso_image vdso_image_32_int80;
|
||||
#ifdef CONFIG_COMPAT
|
||||
extern const struct vdso_image vdso_image_32_syscall;
|
||||
#endif
|
||||
extern const struct vdso_image vdso_image_32_sysenter;
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can understand from the name of the `vdso_image` structure, it represent image of the `vDSO` for the certain mode of the system call entry. This structure contains information about size in bytes of the `vDSO` area that always a multiple of `PAGE_SIZE` (`4096` bytes), pointer to the text mapping, start and end address of the `alternatives` (set of instructions with better alternatives for the certaint type of the processor) and etc. For example `vdso_image_64` looks like this:
|
||||
|
||||
```C
|
||||
const struct vdso_image vdso_image_64 = {
|
||||
.data = raw_data,
|
||||
.size = 8192,
|
||||
.text_mapping = {
|
||||
.name = "[vdso]",
|
||||
.pages = pages,
|
||||
},
|
||||
.alt = 3145,
|
||||
.alt_len = 26,
|
||||
.sym_vvar_start = -8192,
|
||||
.sym_vvar_page = -8192,
|
||||
.sym_hpet_page = -4096,
|
||||
};
|
||||
```
|
||||
|
||||
Where the `raw_data` contains raw binary code of the 64-bit `vDSO` system calls which are `2` page size:
|
||||
|
||||
```C
|
||||
static struct page *pages[2];
|
||||
```
|
||||
|
||||
or 8 Kilobytes.
|
||||
|
||||
The `init_vdso_image` function defined in the same source code file and just initializes the `vdso_image.text_mapping.pages`. First of all this function calculates the number of pages and initializes each `vdso_image.text_mapping.pages[number_of_page]` with the `virt_to_page` macro that converts given address to the `page` structure:
|
||||
|
||||
```C
|
||||
void __init init_vdso_image(const struct vdso_image *image)
|
||||
{
|
||||
int i;
|
||||
int npages = (image->size) / PAGE_SIZE;
|
||||
|
||||
for (i = 0; i < npages; i++)
|
||||
image->text_mapping.pages[i] =
|
||||
virt_to_page(image->data + i*PAGE_SIZE);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `init_vdso` function passed to the `subsys_initcall` macro that adds the given function to the `initcalls` list. All functions from this list will be called in the `do_initcalls` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
|
||||
```C
|
||||
subsys_initcall(init_vdso);
|
||||
```
|
||||
|
||||
Ok, we just saw initialization of the `vDSO` and initialization of `page` structures that are related to the memory pages that contain `vDSO` system calls. But where do there pages mapped? Actually they are mapped by the kernel, when it loads binary to the memory. The Linux kernel calls the `arch_setup_additional_pages` function from the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file that checks that `vDSO` enabled for the `x86_64` and calls the `map_vdso` function:
|
||||
|
||||
```C
|
||||
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
|
||||
{
|
||||
if (!vdso64_enabled)
|
||||
return 0;
|
||||
|
||||
return map_vdso(&vdso_image_64, true);
|
||||
}
|
||||
```
|
||||
|
||||
The `map_vdso` function defined in the same source code file and maps pages for the `vDSO` and for the shared `vDSO` variables. That's all. Main differences between the `vsyscall` and the `vDSO` concepts that first has static and each time the same address `ffffffffff600000` and the second loads dynamically and the second `vDSO` implements four system calls:
|
||||
|
||||
* `__vdso_clock_gettime`;
|
||||
* `__vdso_getcpu`;
|
||||
* `__vdso_gettimeofday`;
|
||||
* `__vdso_time`.
|
||||
|
||||
when `vsyscall` only `3`.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) we implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned to knew two concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
|
||||
|
||||
After all of these three parts, we know almost all things that are related to system calls, we know what is it system call and why do user applications need in they, what do occur when an user application calls system call and what does kernel handles system calls.
|
||||
|
||||
The next part will be last part in this [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [x86_64 memory map](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [context switching](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
|
||||
* [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space)
|
||||
* [Segmentation](https://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [enum](https://en.wikipedia.org/wiki/Enumerated_type)
|
||||
* [fix-mapped addresses](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [Page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [stack pointer](https://en.wikipedia.org/wiki/Stack_register)
|
||||
* [uname](https://en.wikipedia.org/wiki/Uname)
|
||||
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html)
|
Loading…
Reference in New Issue