Merge pull request #4 from 0xAX/master

Update 02.02.17
This commit is contained in:
Yaroslav Pronin 2017-02-02 20:34:03 +03:00 committed by GitHub
commit ce61a94581
18 changed files with 466 additions and 60 deletions

View File

@ -3,8 +3,8 @@
This chapter describes the linux kernel boot process. Here you will see a
couple of posts which describes the full cycle of the kernel loading process:
* [From the bootloader to kernel](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) - describes all stages from turning on the computer to running the first instruction of the kernel.
* [First steps in the kernel setup code](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) - describes first steps in the kernel setup code. You will see heap initialization, query of different parameters like EDD, IST and etc...
* [Video mode initialization and transition to protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) - describes video mode initialization in the kernel setup code and transition to protected mode.
* [Transition to 64-bit mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-4.html) - describes preparation for transition into 64-bit mode and details of transition.
* [Kernel Decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) - describes preparation before kernel decompression and details of direct decompression.
* [From the bootloader to kernel](linux-bootstrap-1.md) - describes all stages from turning on the computer to running the first instruction of the kernel.
* [First steps in the kernel setup code](linux-bootstrap-2.md) - describes first steps in the kernel setup code. You will see heap initialization, query of different parameters like EDD, IST and etc...
* [Video mode initialization and transition to protected mode](linux-bootstrap-3.md) - describes video mode initialization in the kernel setup code and transition to protected mode.
* [Transition to 64-bit mode](linux-bootstrap-4.md) - describes preparation for transition into 64-bit mode and details of transition.
* [Kernel Decompression](linux-bootstrap-5.md) - describes preparation before kernel decompression and details of direct decompression.

View File

@ -234,7 +234,7 @@ X+08000 +------------------------+
So, when the bootloader transfers control to the kernel, it starts at:
```
0x1000 + X + sizeof(KernelBootSector) + 1
X + sizeof(KernelBootSector) + 1
```
where `X` is the address of the kernel boot sector being loaded. In my case, `X` is `0x10000`, as we can see in a memory dump:

View File

@ -53,7 +53,7 @@ Earlier we saw definitions of different data types like `u16` etc. in the kernel
|------|------|-------|-----|------|----|-----|-----|-----|
| Size | 1 | 2 | 4 | 8 | 1 | 2 | 4 | 8 |
If you the read source code of the kernel, you'll see these very often and so it will be good to remember them.
If you read the source code of the kernel, you'll see these very often and so it will be good to remember them.
Heap API
--------------------------------------------------------------------------------

View File

@ -2,4 +2,4 @@
This chapter describes `control groups` mechanism in the Linux kernel.
* [Introduction](http://0xax.gitbooks.io/linux-insides/content/Cgroups/cgroups1.html)
* [Introduction](cgroups1.md)

View File

@ -2,6 +2,6 @@
This chapter describes various concepts which are used in the Linux kernel.
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
* [The initcall mechanism](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)
* [Per-CPU variables](per-cpu.md)
* [CPU masks](cpumask.md)
* [The initcall mechanism](initcall.md)

View File

@ -5,6 +5,6 @@ Linux kernel provides different implementations of data structures like doubly l
This part considers the following data structures and algorithms:
* [Doubly linked list](https://github.com/0xAX/linux-insides/blob/master/DataStructures/dlist.md)
* [Radix tree](https://github.com/0xAX/linux-insides/blob/master/DataStructures/radix-tree.md)
* [Bit arrays](https://github.com/0xAX/linux-insides/blob/master/DataStructures/bitmap.md)
* [Doubly linked list](dlist.md)
* [Radix tree](radix-tree.md)
* [Bit arrays](bitmap.md)

View File

@ -4,13 +4,13 @@ You will find here a couple of posts which describe the full cycle of kernel ini
*Note* That there will not be a description of the all kernel initialization steps. Here will be only generic kernel part, without interrupts handling, ACPI, and many other parts. All parts which I have missed, will be described in other chapters.
* [First steps after kernel decompression](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-1.md) - describes first steps in the kernel.
* [Early interrupt and exception handling](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) - describes early interrupts initialization and early page fault handler.
* [Last preparations before the kernel entry point](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-3.md) - describes the last preparations before the call of the `start_kernel`.
* [Kernel entry point](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-4.md) - describes first steps in the kernel generic code.
* [Continue of architecture-specific initializations](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-5.md) - describes architecture-specific initialization.
* [Architecture-specific initializations, again...](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-6.md) - describes continue of the architecture-specific initialization process.
* [The End of the architecture-specific initializations, almost...](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md) - describes the end of the `setup_arch` related stuff.
* [Scheduler initialization](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-8.md) - describes preparation before scheduler initialization and initialization of it.
* [RCU initialization](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-9.md) - describes the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update).
* [End of the initialization](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-10.md) - the last part about linux kernel initialization.
* [First steps after kernel decompression](linux-initialization-1.md) - describes first steps in the kernel.
* [Early interrupt and exception handling](linux-initialization-2.md) - describes early interrupts initialization and early page fault handler.
* [Last preparations before the kernel entry point](linux-initialization-3.md) - describes the last preparations before the call of the `start_kernel`.
* [Kernel entry point](linux-initialization-4.md) - describes first steps in the kernel generic code.
* [Continue of architecture-specific initializations](linux-initialization-5.md) - describes architecture-specific initialization.
* [Architecture-specific initializations, again...](linux-initialization-6.md) - describes continue of the architecture-specific initialization process.
* [The End of the architecture-specific initializations, almost...](linux-initialization-7.md) - describes the end of the `setup_arch` related stuff.
* [Scheduler initialization](linux-initialization-8.md) - describes preparation before scheduler initialization and initialization of it.
* [RCU initialization](linux-initialization-9.md) - describes the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update).
* [End of the initialization](linux-initialization-10.md) - the last part about linux kernel initialization.

View File

@ -19,6 +19,7 @@ On other languages
* [Chinese](https://github.com/MintCN/linux-insides-zh)
* [Spanish](https://github.com/leolas95/linux-insides)
* [Russian](https://github.com/proninyaroslav/linux-insides-ru)
LICENSE
-------------

View File

@ -33,6 +33,7 @@
* [How the Linux kernel handles a system call](SysCall/syscall-2.md)
* [vsyscall and vDSO](SysCall/syscall-3.md)
* [How the Linux kernel runs a program](SysCall/syscall-4.md)
* [Implementation of the open system call](SysCall/syscall-5.md)
* [Timers and time management](Timers/README.md)
* [Introduction](Timers/timers-1.md)
* [Clocksource framework](Timers/timers-2.md)

View File

@ -2,9 +2,9 @@
This chapter describes synchronization primitives in the Linux kernel.
* [Introduction to spinlocks](http://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) - the first part of this chapter describes implementation of spinlock mechanism in the Linux kernel.
* [Queued spinlocks](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html) - the second part describes another type of spinlocks - queued spinlocks.
* [Semaphores](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html) - this part describes implementation of `semaphore` synchronization primitive in the Linux kernel.
* [Mutual exclusion](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html) - this part describes - `mutex` in the Linux kernel.
* [Reader/Writer semaphores](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) - this part describes special type of semaphores - `reader/writer` semaphores.
* [Sequential locks](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-6.html) - this part describes sequential locks in the Linux kernel.
* [Introduction to spinlocks](sync-1.md) - the first part of this chapter describes implementation of spinlock mechanism in the Linux kernel.
* [Queued spinlocks](sync-2.md) - the second part describes another type of spinlocks - queued spinlocks.
* [Semaphores](sync-3.md) - this part describes implementation of `semaphore` synchronization primitive in the Linux kernel.
* [Mutual exclusion](sync-4.md) - this part describes - `mutex` in the Linux kernel.
* [Reader/Writer semaphores](sync-5.md) - this part describes special type of semaphores - `reader/writer` semaphores.
* [Sequential locks](sync-6.md) - this part describes sequential locks in the Linux kernel.

View File

@ -2,7 +2,8 @@
This chapter describes the `system call` concept in the linux kernel.
* [Introduction to system call concept](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) - this part is introduction to the `system call` concept in the Linux kernel.
* [How the Linux kernel handles a system call](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - this part describes how the Linux kernel handles a system call from an userspace application.
* [vsyscall and vDSO](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - third part describes `vsyscall` and `vDSO` concepts.
* [How the Linux kernel runs a program](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) - this part describes startup process of a program.
* [Introduction to system call concept](syscall-1.md) - this part is introduction to the `system call` concept in the Linux kernel.
* [How the Linux kernel handles a system call](syscall-2.md) - this part describes how the Linux kernel handles a system call from an userspace application.
* [vsyscall and vDSO](syscall-3.md) - third part describes `vsyscall` and `vDSO` concepts.
* [How the Linux kernel runs a program](syscall-4.md) - this part describes startup process of a program.
* [Implementation of the open system call](syscall-5.md) - this part describes implementation of the [open](http://man7.org/linux/man-pages/man2/open.2.html) system call.

403
SysCall/syscall-5.md Normal file
View File

@ -0,0 +1,403 @@
How does the `open` system call work
--------------------------------------------------------------------------------
Introduction
--------------------------------------------------------------------------------
This is the fifth part of the chapter that describes [system calls](https://en.wikipedia.org/wiki/System_call) mechanism in the Linux kernel. Previous parts of this chapter described this mechanism in general. Now I will try to describe implementation of different system calls in the Linux kernel. Previous parts from this chapter and parts from other chapters of the books describe mostly deep parts of the Linux kernel that are faintly visible or fully invisible from the userspace. But the Linux kernel code is not only about itself. The vast of the Linux kernel code provides ability to our code. Due to the linux kernel our programs can read/write from/to files and don't know anything about sectors, tracks and other parts of a disk structures, we can send data over network and don't build encapsulated network packets by hand and etc.
I don't know how about you, but it is interesting to me not only how an operating system works, but how do my software interacts with it. As you may know, our programs interacts with the kernel through the special mechanism which is called [system call](https://en.wikipedia.org/wiki/System_call). So, I've decided to write series of parts which will describe implementation and behavior of system calls which we are using every day like `read`, `write`, `open`, `close`, `dup` and etc.
I have decided to start from the description of the [open](http://man7.org/linux/man-pages/man2/open.2.html) system call. if you have written at least one `C` program, you should know that before we are able to read/write or execute other manipulations with a file we need to open it with the `open` function:
```C
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
int main(int argc, char *argv) {
int fd = open("test", O_RDONLY);
if fd < 0 {
perror("Opening of the file is failed\n");
}
else {
printf("file sucessfully opened\n");
}
close(fd);
return 0;
}
```
In this case, the open is the function from standard library, but not system call. The standard library will call related system call for us. The `open` call will return a [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) which is just an unique number within our process which is associated with the opened file. Now as we opened a file and got file descriptor as result of `open` call, we may start to interact with this file. We can write into, read from it and etc. List of opened file by a process is available via [proc](https://en.wikipedia.org/wiki/Procfs) filesystem:
```
$ sudo ls /proc/1/fd/
0 10 12 14 16 2 21 23 25 27 29 30 32 34 36 38 4 41 43 45 47 49 50 53 55 58 6 61 63 67 8
1 11 13 15 19 20 22 24 26 28 3 31 33 35 37 39 40 42 44 46 48 5 51 54 57 59 60 62 65 7 9
```
I am not going to describe more details about the `open` routine from the userspace view in this post, but mostly from the kernel side. if you are not very familiar with, you can get more info in the [man page](http://man7.org/linux/man-pages/man2/open.2.html).
So let's start.
Definition of the open system call
--------------------------------------------------------------------------------
If you have read the [fourth part](https://github.com/0xAX/linux-insides/blob/master/SysCall/syscall-4.md) of the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/index.html) book, you should know that system calls are defined with the help of `SYSCALL_DEFINE` macro. So, the `open` system call is not exception.
Definition of the `open` system call is located in the [fs/open.c](https://github.com/torvalds/linux/blob/master/fs/open.c) source code file and looks pretty small for the first view:
```C
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
if (force_o_largefile())
flags |= O_LARGEFILE;
return do_sys_open(AT_FDCWD, filename, flags, mode);
}
```
As you may guess, the `do_sys_open` function from the [same](https://github.com/torvalds/linux/blob/master/fs/open.c) source code file does the main job. But before this function will be called, let's consider the `if` clause from which the implementation of the `open` system call starts:
```C
if (force_o_largefile())
flags |= O_LARGEFILE;
```
Here we apply the `O_LARGEFILE` flag to the flags which were passed to `open` system call in a case when the `force_o_largefile()` will return true.
What is `O_LARGEFILE`? We may read this in the [man page](http://man7.org/linux/man-pages/man2/open.2.html) for the `open(2)` system call:
> O_LARGEFILE
>
> (LFS) Allow files whose sizes cannot be represented in an off_t (but can be represented in an off64_t) to be opened.
As we may read in the [GNU C Library Reference Manual](https://www.gnu.org/software/libc/manual/html_mono/libc.html#File-Position-Primitive):
> off_t
>
> This is a signed integer type used to represent file sizes.
> In the GNU C Library, this type is no narrower than int.
> If the source is compiled with _FILE_OFFSET_BITS == 64 this
> type is transparently replaced by off64_t.
and
> off64_t
>
> This type is used similar to off_t. The difference is that
> even on 32 bit machines, where the off_t type would have 32 bits,
> off64_t has 64 bits and so is able to address files up to 2^63 bytes
> in length. When compiling with _FILE_OFFSET_BITS == 64 this type
> is available under the name off_t.
So it is not hard to guess that the `off_t`, `off64_t` and `O_LARGEFILE` are about a file size. In the case of the Linux kernel, the `O_LARGEFILE` is used to disallow opening large files on 32bit systems if the caller didn't specify `O_LARGEFILE` flag during opening of a file. On 64bit systems we force on this flag in open system call. And the `force_o_largefile` macro from the [include/linux/fcntl.h](https://github.com/torvalds/linux/blob/master/include/linux/fcntl.h#L7) linux kernel header file confirms this:
```C
#ifndef force_o_largefile
#define force_o_largefile() (BITS_PER_LONG != 32)
#endif
```
This macro may be architecture-specific as for example for [IA-64](https://en.wikipedia.org/wiki/IA-64) architecture, but in our case the [x86_64](https://en.wikipedia.org/wiki/X86-64) does not provide definition of the `force_o_largefile` and it will be used from [include/linux/fcntl.h](https://github.com/torvalds/linux/blob/master/include/linux/fcntl.h#L7).
So, as we may see the `force_o_largefile` is just a macro which expands to the `true` value in our case of [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. As we are considering 64-bit architecture, the `force_o_largefile` will be expanded to `true` and the `O_LARGEFILE` flag will be added to the set of flags which were passed to the `open` system call.
Now as we considered meaning of the `O_LARGEFILE` flag and `force_o_largefile` macro, we can proceed to the consideration of the implementation of the `do_sys_open` function. As I wrote above, this function is defined in the [same](https://github.com/torvalds/linux/blob/master/fs/open.c) source code file and looks:
```C
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
struct open_flags op;
int fd = build_open_flags(flags, mode, &op);
struct filename *tmp;
if (fd)
return fd;
tmp = getname(filename);
if (IS_ERR(tmp))
return PTR_ERR(tmp);
fd = get_unused_fd_flags(flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
put_unused_fd(fd);
fd = PTR_ERR(f);
} else {
fsnotify_open(f);
fd_install(fd, f);
}
}
putname(tmp);
return fd;
}
```
Let's try to understand how the `do_sys_open` works step by step.
open(2) flags
--------------------------------------------------------------------------------
As you know the `open` system call takes set of `flags` as second argument that control opening a file and `mode` as third argument that specifies permission the permissions of a file if it is created. The `do_sys_open` function starts from the call of the `build_open_flags` function which does some checks that set of the given flags is valid and handles different conditions of flags and mode.
Let's look at the implementation of the `build_open_flags`. This function is defined in the [same](https://github.com/torvalds/linux/blob/master/fs/open.c) kernel file and takes three arguments:
* flags - flags that control opening of a file;
* mode - permissions for newly created file;
The last argument - `op` is represented with the `open_flags` structure:
```C
struct open_flags {
int open_flag;
umode_t mode;
int acc_mode;
int intent;
int lookup_flags;
};
```
which is defined in the [fs/internal.h](https://github.com/torvalds/linux/blob/master/fs/internal.h#L99) header file and as we may see it holds information about flags and access mode for internal kernel purposes. As you already may guess the main goal of the `build_open_flags` function is to fill an instance of this structure.
Implementation of the `build_open_flags` function starts from the definition of local variables and one of them is:
```C
int acc_mode = ACC_MODE(flags);
```
This local variable represents access mode and its initial value will be equal to the value of expanded `ACC_MODE` macro. This macro is defined in the [include/linux/fs.h](https://github.com/torvalds/linux/blob/master/include/linux/fs.h) and looks pretty interesting:
```C
#define ACC_MODE(x) ("\004\002\006\006"[(x)&O_ACCMODE])
#define O_ACCMODE 00000003
```
The `"\004\002\006\006"` is an array of four chars:
```
"\004\002\006\006" == {'\004', '\002', '\006', '\006'}
```
So, the `ACC_MODE` macro just expands to the accession to this array by `[(x) & O_ACCMODE]` index. As we just saw, the `O_ACCMODE` is `00000003`. By applying `x & O_ACCMODE` we will take the two least significant bits which are represents `read`, `write` or `read/write` access modes:
```C
#define O_RDONLY 00000000
#define O_WRONLY 00000001
#define O_RDWR 00000002
```
After getting value from the array by the calculated index, the `ACC_MODE` will be expanded to access mode mask of a file which will hold `MAY_WRITE`, `MAY_READ` and other information.
We may see following condition after we have calculated initial access mode:
```C
if (flags & (O_CREAT | __O_TMPFILE))
op->mode = (mode & S_IALLUGO) | S_IFREG;
else
op->mode = 0;
```
Here we reset permissions in `open_flags` instance if a opened file wasn't temporary and wasn't open for creation. This is because:
> if neither O_CREAT nor O_TMPFILE is specified, then mode is ignored.
In other case if `O_CREAT` or `O_TMPFILE` were passed we canonicalize it to a regular file because a directory should be created with the [opendir](http://man7.org/linux/man-pages/man3/opendir.3.html) system call.
At the next step we check that a file is not tried to be opened via [fanotify](http://man7.org/linux/man-pages/man7/fanotify.7.html) and without the `O_CLOEXEC` flag:
```C
flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC;
```
We do this to not leak a [file descriptor](https://en.wikipedia.org/wiki/File_descriptor). By default, the new file descriptor is set to remain open across an `execve` system call, but the `open` system call supports `O_CLOEXEC` flag that can be used to change this default behaviour. So we do this to prevent leaking of a file descriptor when one thread opens a file to set `O_CLOEXEC` flag and in the same time the second process does a [fork](https://en.wikipedia.org/wiki/Fork_(system_call)) + [execve](https://en.wikipedia.org/wiki/Exec_(system_call)) and as you may remember that child will have copies of the parent's set of open file descriptors.
At the next step we check that if our flags contains `O_SYNC` flag, we apply `O_DSYNC` flag too:
```
if (flags & __O_SYNC)
flags |= O_DSYNC;
```
The `O_SYNC` flag guarantees that the any write call will not return before all data has been transferred to the disk. The `O_DSYNC` is like `O_SYNC` except that there is no requirement to wait for any metadata (like `atime`, `mtime` and etc.) changes will be written. We apply `O_DSYNC` in a case of `__O_SYNC` because it is implemented as `__O_SYNC|O_DSYNC` in the Linux kernel.
After this we must be sure that if a user wants to create temporary file, the flags should contain `O_TMPFILE_MASK` or in other words it should contain or `O_CREAT` or `O_TMPFILE` or both and also it should be writeable:
```C
if (flags & __O_TMPFILE) {
if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
return -EINVAL;
if (!(acc_mode & MAY_WRITE))
return -EINVAL;
} else if (flags & O_PATH) {
flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
acc_mode = 0;
}
```
as it is written in in the manual page:
> O_TMPFILE must be specified with one of O_RDWR or O_WRONLY
If we didn't pass `O_TMPFILE` for creation of a temporary file, we check the `O_PATH` flag at the next condition. The `O_PATH` flag allows us to obtain a file descriptor that may be used for two following purposes:
* to indicate a location in the filesystem tree;
* to perform operations that act purely at the file descriptor level.
So, in this case the file itself is not opened, but operations like `dup`, `fcntl` and other can be used. So, if all file content related operations like `read`, `write` and other are permitted, only `O_DIRECTORY | O_NOFOLLOW | O_PATH` flags can be used. We have finished with flags for this moment in the `build_open_flags` for this moment and we may fill our `open_flags->open_flag` with them:
```C
op->open_flag = flags;
```
Now we have filled `open_flag` field which represents flags that will control opening of a file and `mode` that will represent `umask` of a new file if we open file for creation. There are still to fill last flags in the our `open_flags` structure. The next is `op->acc_mode` which represents access mode to a opened file. We already filled the `acc_mode` local variable with the initial value at the beginning of the `build_open_flags` and now we check last two flags related to access mode:
```C
if (flags & O_TRUNC)
acc_mode |= MAY_WRITE;
if (flags & O_APPEND)
acc_mode |= MAY_APPEND;
op->acc_mode = acc_mode;
```
These flags are - `O_TRUNC` that will truncate an opened file to length `0` if it existed before we open it and the `O_APPEND` flag allows to open a file in `append mode`. So the opened file will be appended during write but not overwritten.
The next field of the `open_flags` structure is - `intent`. It allows us to know about our intention or in other words what do we really want to do with file, open it, create, rename it or something else. So we set it to zero if our flags contains the `O_PATH` flag as we can't do anything related to a file content with this flag:
```C
op->intent = flags & O_PATH ? 0 : LOOKUP_OPEN;
```
or just to `LOOKUP_OPEN` intention. Additionally we set `LOOKUP_CREATE` intention if we want to create new file and to be sure that a file didn't exist before with `O_EXCL` flag:
```C
if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
if (flags & O_EXCL)
op->intent |= LOOKUP_EXCL;
}
```
The last flag of the `open_flags` structure is the `lookup_flags`:
```C
if (flags & O_DIRECTORY)
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
op->lookup_flags = lookup_flags;
return 0;
```
We fill it with `LOOKUP_DIRECTORY` if we want to open a directory and `LOOKUP_FOLLOW` if we don't want to follow (open) [symlink](https://en.wikipedia.org/wiki/Symbolic_link). That's all. It is the end of the `build_open_flags` function. The `open_flags` structure is filled with modes and flags for a file opening and we can return back to the `do_sys_open`.
Actual opening of a file
--------------------------------------------------------------------------------
At the next step after `build_open_flags` function is finished and we have formed flags and modes for our file we should get the `filename` structure with the help of the `getname` function by name of a file which was passed to the `open` system call:
```C
tmp = getname(filename);
if (IS_ERR(tmp))
return PTR_ERR(tmp);
```
The `getname` function is defined in the [fs/namei.c](https://github.com/torvalds/linux/blob/master/fs/namei.c) source code file and looks:
```C
struct filename *
getname(const char __user * filename)
{
return getname_flags(filename, 0, NULL);
}
```
So, it just calls the `getname_flags` function and returns its result. The main goal of the `getname_flags` function is to copy a file path given from userland to kernel space. The `filename` structure is defined in the [include/linux/fs.h](https://github.com/torvalds/linux/blob/master/include/linux/fs.h) linux kernel header file and contains following fields:
* name - pointer to a file path in kernel space;
* uptr - original pointer from userland;
* aname - filename from [audit](https://linux.die.net/man/8/auditd) context;
* refcnt - reference counter;
* iname - a filename in a case when it will be less than `PATH_MAX`.
As I already wrote above, the main goal of the `getname_flags` function is to copy name of a file which was passed to the `open` system call from user space to kernel space with the strncpy_from_user function. The next step after a filename will be copied to kernel space is getting of new non-busy file descriptor:
```C
fd = get_unused_fd_flags(flags);
```
The `get_unused_fd_flags` function takes table of open files of the current process, minimum (`0`) and maximum (`RLIMIT_NOFILE`) possible number of a file descriptor in the system and flags that we have passed to the `open` system call and allocates file descriptor and mark it busy in the file descriptor table of the current process. The `get_unused_fd_flags` function sets or clears the `O_CLOEXEC` flag depends on its state in the passed flags.
The last and main step in the `do_sys_open` is the `do_filp_open` function:
```C
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
put_unused_fd(fd);
fd = PTR_ERR(f);
} else {
fsnotify_open(f);
fd_install(fd, f);
}
```
The main goal of this function is to resolve given path name into `file` structure which represents an opened file of a process. If something going wrong and execution of the `do_filp_open` function will be failed, we should free new file descriptor with the `put_unused_fd` or in other way the `file` structure returned by the `do_filp_open` will be stored in the file descriptor table of the current process.
Now let's take a short look at the implementation of the `do_filp_open` function. This function is defined in the [fs/namei.c](https://github.com/torvalds/linux/blob/master/fs/namei.c) linux kernel source code file and starts from initialization of the `nameidata` structure. This structure will provide a link to a file [inode](https://en.wikipedia.org/wiki/Inode). Actually this is one of the main point of the `do_filp_open` function to acquire an `inode` by the filename given to `open` system call. After the `nameidata` structure will be initialized, the `path_openat` function will be called:
```C
filp = path_openat(&nd, op, flags | LOOKUP_RCU);
if (unlikely(filp == ERR_PTR(-ECHILD)))
filp = path_openat(&nd, op, flags);
if (unlikely(filp == ERR_PTR(-ESTALE)))
filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
```
Note that it is called three times. Actually, the Linux kernel will open the file in [RCU](https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt) mode. This is the most efficient way to open a file. If this try will be failed, the kernel enters the normal mode. The third call is relatively rare, only in the [nfs](https://en.wikipedia.org/wiki/Network_File_System) file system is likely to be used. The `path_openat` function executes `path lookup` or in other words it tries to find a `dentry` (what the Linux kernel uses to keep track of the hierarchy of files in directories) corresponding to a path.
The `path_openat` function starts from the call of the `get_empty_flip()` function that allocates a new `file` structure with some additional checks like do we exceed amount of opened files in the system or not and etc. After we have got allocated new `file` structure we call the `do_tmpfile` or `do_o_path` functions in a case if we have passed `O_TMPFILE | O_CREATE` or `O_PATH` flags during call of the `open` system call. These both cases are quite specific, so let's consider quite usual case when we want to open already existed file and want to read/write from/to it.
In this case the `path_init` function will be called. This function performs some preporatory work before actual path lookup. This includes search of start position of path traversal and its metadata like `inode` of the path, `dentry inode` and etc. This can be `root` directory - `/` or current directory as in our case, because we use `AT_CWD` as starting point (see call of the `do_sys_open` at the beginning of the post).
The next step after the `path_init` is the [loop](https://github.com/torvalds/linux/blob/master/fs/namei.c#L3457) which executes the `link_path_walk` and `do_last`. The first function executes name resolution or in other words this function starts process of walking along a given path. It handles everything step by step except the last component of a file path. This handling includes checking of a permissions and getting a file component. As a file component is gotten, it is passed to `walk_component` that updates current directory entry from the `dcache` or asks underlying filesystem. This repeats before all path's components will not be handled in such way. After the `link_path_walk` will be executed, the `do_last` function will populate a `file` structure based on the result of the `link_path_walk`. As we reached last component of the given file path the `vfs_open` function from the `do_last` will be called.
This function is defined in the [fs/open.c](https://github.com/torvalds/linux/blob/master/fs/open.c) linux kernel source code file and the main goal of this function is to call an `open` operation of underlying filesystem.
That's all for now. We didn't consider **full** implementation of the `open` system call. We skip some parts like handling case when we want to open a file from other filesystem with different mount point, resolving symlinks and etc., but it should be not so hard to follow this stuff. This stuff does not included in **generic** implementation of open system call and depends on underlying filesystem. If you are interested in, you may lookup the `file_operations.open` callback function for a certain [filesystem](https://github.com/torvalds/linux/tree/master/fs).
Conclusion
--------------------------------------------------------------------------------
This is the end of the fifth part of the implementation of different system calls in the Linux kernel. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part, we will continue to dive into system calls in the Linux kernel and see the implementation of the [read](http://man7.org/linux/man-pages/man2/read.2.html) system call.
**Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
Links
--------------------------------------------------------------------------------
* [system call](https://en.wikipedia.org/wiki/System_call)
* [open](http://man7.org/linux/man-pages/man2/open.2.html)
* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
* [proc](https://en.wikipedia.org/wiki/Procfs)
* [GNU C Library Reference Manual](https://www.gnu.org/software/libc/manual/html_mono/libc.html#File-Position-Primitive)
* [IA-64](https://en.wikipedia.org/wiki/IA-64)
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [opendir](http://man7.org/linux/man-pages/man3/opendir.3.html)
* [fanotify](http://man7.org/linux/man-pages/man7/fanotify.7.html)
* [fork](https://en.wikipedia.org/wiki/Fork_(system_call))
* [execve](https://en.wikipedia.org/wiki/Exec_(system_call))
* [symlink](https://en.wikipedia.org/wiki/Symbolic_link)
* [audit](https://linux.die.net/man/8/auditd)
* [inode](https://en.wikipedia.org/wiki/Inode)
* [RCU](https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt)
* [read](http://man7.org/linux/man-pages/man2/read.2.html)
* [previous part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html)

View File

@ -2,6 +2,6 @@
This chapter describes various theoretical concepts and concepts which are not directly related to practice but useful to know.
* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
* [Elf64 format](http://0xax.gitbooks.io/linux-insides/content/Theory/ELF.html)
* [Inline assembly](http://0xax.gitbooks.io/linux-insides/content/Theory/asm.html)
* [Paging](Paging.md)
* [Elf64 format](ELF.md)
* [Inline assembly](asm.md)

View File

@ -2,10 +2,10 @@
This chapter describes timers and time management related concepts in the linux kernel.
* [Introduction](http://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) - An introduction to the timers in the Linux kernel.
* [Introduction to the clocksource framework](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-2.md) - Describes `clocksource` framework in the Linux kernel.
* [The tick broadcast framework and dyntick](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-3.md) - Describes tick broadcast framework and dyntick concept.
* [Introduction to timers](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-3.md) - Describes timers in the Linux kernel.
* [Introduction to the clockevents framework](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-5.md) - Describes yet another clock/time management related framework : `clockevents`.
* [x86 related clock sources](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-5.md) - Describes `x86_64` related clock sources.
* [Time related system calls in the Linux kernel](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-7.md) - Describes time related system calls.
* [Introduction](timers-1.md) - An introduction to the timers in the Linux kernel.
* [Introduction to the clocksource framework](timers-2.md) - Describes `clocksource` framework in the Linux kernel.
* [The tick broadcast framework and dyntick](timers-3.md) - Describes tick broadcast framework and dyntick concept.
* [Introduction to timers](timers-4.md) - Describes timers in the Linux kernel.
* [Introduction to the clockevents framework](timers-5.md) - Describes yet another clock/time management related framework : `clockevents`.
* [x86 related clock sources](timers-6.md) - Describes `x86_64` related clock sources.
* [Time related system calls in the Linux kernel](timers-7.md) - Describes time related system calls.

View File

@ -2,13 +2,13 @@
In the following posts, we will cover interrupts and exceptions handling in the linux kernel.
* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes interrupts and interrupt handling theory.
* [Interrupts in the Linux Kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - describes early interrupt handlers.
* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - describes first non-early interrupt handlers.
* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
* [Handling non-maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
* [External hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
* [Softirq, Tasklets and Workqueues](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
* [](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.
* [Interrupts and Interrupt Handling. Part 1.](interrupts-1.md) - describes interrupts and interrupt handling theory.
* [Interrupts in the Linux Kernel](interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
* [Early interrupt handlers](interrupts-3.md) - describes early interrupt handlers.
* [Interrupt handlers](interrupts-4.md) - describes first non-early interrupt handlers.
* [Implementation of exception handlers](interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
* [Handling non-maskable interrupts](interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
* [External hardware interrupts](interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
* [Non-early initialization of the IRQs](interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
* [Softirq, Tasklets and Workqueues](interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
* [Last part](interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.

View File

@ -301,7 +301,7 @@ In the end of the `early_irq_init` function we return the return value of the `a
return arch_early_irq_init();
```
This function defined in the [kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/kernel/apic/vector.c) and contains only one call of the `arch_early_ioapic_init` function from the [kernel/apic/io_apic.c](https://github.com/torvalds/linux/blob/master/kernel/apic/io_apic.c). As we can understand from the `arch_early_ioapic_init` function's name, this function makes early initialization of the [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). First of all it make a check of the number of the legacy interrupts wit the call of the `nr_legacy_irqs` function. If we have no legacy interrupts with the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) programmable interrupt controller we set `io_apic_irqs` to the `0xffffffffffffffff`:
This function defined in the [kernel/apic/vector.c](https://github.com/torvalds/linux/blob/master/kernel/apic/vector.c) and contains only one call of the `arch_early_ioapic_init` function from the [kernel/apic/io_apic.c](https://github.com/torvalds/linux/blob/master/kernel/apic/io_apic.c). As we can understand from the `arch_early_ioapic_init` function's name, this function makes early initialization of the [I/O APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). First of all it make a check of the number of the legacy interrupts with the call of the `nr_legacy_irqs` function. If we have no legacy interrupts with the [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) programmable interrupt controller we set `io_apic_irqs` to the `0xffffffffffffffff`:
```C
if (!nr_legacy_irqs())

View File

@ -28,7 +28,7 @@ where `NR_VECTORS` is count of the vector number and as you can remember from th
#define NR_VECTORS 256
```
So, in the start of the `init_IRQ` function we fill the `vecto_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) array with the vector number of the `legacy` interrupts:
So, in the start of the `init_IRQ` function we fill the `vector_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) array with the vector number of the `legacy` interrupts:
```C
void __init init_IRQ(void)
@ -241,7 +241,7 @@ do { \
} while (0)
```
As we can see, first of all it expands to the call of the `alloc_system_vector` function that checks the given vector number in the `user_vectors` bitmap (read previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) about it) and if it is not set in the `user_vectors` bitmap we set it. After this we test that the `first_system_vector` is greater than given interrupt vector number and if it is greater we assign it:
As we can see, first of all it expands to the call of the `alloc_system_vector` function that checks the given vector number in the `used_vectors` bitmap (read previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) about it) and if it is not set in the `used_vectors` bitmap we set it. After this we test that the `first_system_vector` is greater than given interrupt vector number and if it is greater we assign it:
```C
if (!test_bit(vector, used_vectors)) {

View File

@ -3,6 +3,6 @@
This chapter describes memory management in the linux kernel. You will see here a
couple of posts which describe different parts of the linux memory management framework:
* [Memblock](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-1.md) - describes early `memblock` allocator.
* [Fix-Mapped Addresses and ioremap](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) - describes `fix-mapped` addresses and early `ioremap`.
* [kmemcheck](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-3.md) - third part describes `kmemcheck` tool.
* [Memblock](linux-mm-1.md) - describes early `memblock` allocator.
* [Fix-Mapped Addresses and ioremap](linux-mm-2.md) - describes `fix-mapped` addresses and early `ioremap`.
* [kmemcheck](linux-mm-3.md) - third part describes `kmemcheck` tool.