linux-insides/SysCall/syscall-2.md
2015-08-30 20:00:58 +06:00

26 KiB

System calls in the Linux kernel. Part 2.

How the Linux kernel handles a system call

The previous part was the first part of the chapter that describes system call concept in the Linux kernel. We have learned what is it a system call in the Linux kernel and in a operating system kernel in general, looked on this concept from the user space and even saw partly implementation of the write system call. In this part we will continue dive into this theme and as we usually did it in other chapters of this book - after some theory we will start to sink lower and lower, and go directly to the Linux kernel code.

An user application does not call the system call directly from our applications. We did not write Hello world! program like:

int main(int argc, char **argv)
{
	...
	...
	...
	sys_write(fd1, buf, strlen(buf));
	...
	...
}

We can use something similar with the help of C standard library and it will look something like this:

#include <unistd.h>

int main(int argc, char **argv)
{
	...
	...
	...
	write(fd1, buf, strlen(buf));
	...
	...
}

But anyway, write is not directly system call and not kernel function. An application must fill general purpose registers with the correct values and in the fixed order and call syscall instruction to call real system call. In this part we will know, what occurs in the linux kernel when the processor met syscall instruction.

Initialization of the system calls table

From the previous part we know that system call concept is very similar to interrupt. Furthermore system calls implemented as software interrupts. So, when the processor handles syscall instruction from a user application, this instruction causes an exception which transfers control to an exception handler. As we know, all exception handlers (or in other words kernel C functions that will react on a exception) are placed in the kernel code. But how the Linux kernel searches address of the necessary system call handler for the related system call? Linux kernel contains special table which is called - system call table. The system call table represeted by the sys_call_table array in the Linux kernel which defined in the arch/x86/entry/syscall_64.c source code file. Let's look on its implementation:

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
	[0 ... __NR_syscall_max] = &sys_ni_syscall,
    #include <asm/syscalls_64.h>
};

As we can see, the sys_call_table is array of __NR_syscall_max + 1 size where __NR_syscall_max macro represents maximum number of the system calls for the certain architecture. This book is about x86_64 architecture, so for our case the __NR_syscall_max is 322 and this is correc number for now (when I'm writing this part, Linux kernel version is 4.2.0-rc8+). We can see this macro in the generated by the Kbuild during kernel compilation header file - include/generated/asm-offsets.h:

#define __NR_syscall_max 322

The same number of system calls in the arch/x86/entry/syscalls/syscall_64.tbl for the x86_64. The following are two things to which we must turn our attention are type of the sys_call_table array and initialization of elements of this array. First of all about type. The sys_call_ptr_t represents pointer to a system call table. It defined as typedef for a function pointer that returns nothing and and does not take arguments:

typedef void (*sys_call_ptr_t)(void);

The second thing is initialization of the sys_call_table array. As we can see in the code above, all elements of our array that contains pointers to the system call handlers point to the sys_ni_syscall. The sys_ni_syscall function represents not-implemented system calls. Yes, for now all elements of the sys_call_table array point to the not implemented system calls. But it is only for now, it is correct behaviour, because we only initialize storage of the pointers to the system call handlers, later we will fill it. Implementation of the sys_ni_syscall is pretty easy, it just returns -errno or -ENOSYS in our case:

asmlinkage long sys_ni_syscall(void)
{
	return -ENOSYS;
}

The -ENOSYS error talks us that:

ENOSYS          Function not implemented (POSIX.1)

Also note on ... in the initialization of the sys_call_table. We can do it with the extension of the GCC which is called - Designated Initializers. This extension allows us to initialize elements in non-fixed order. As you can note, we include asm/syscalls_64.h header in the end of the array. This header file is generated by the special script that placed in the arch/x86/entry/syscalls/syscalltbl.sh and generates our header file from the syscall table. The asm/syscalls_64.h contains definitions of the following macros:

__SYSCALL_COMMON(0, sys_read, sys_read)
__SYSCALL_COMMON(1, sys_write, sys_write)
__SYSCALL_COMMON(2, sys_open, sys_open)
__SYSCALL_COMMON(3, sys_close, sys_close)
__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat)
...
...
...

The __SYSCALL_COMMON macro defined in the same source code file and expands to the __SYSCALL_64 macro which expands to the function definition:

#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
#define __SYSCALL_64(nr, sym, compat) [nr] = sym,

So, after this, our sys_call_table takes the following form:

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
	[0 ... __NR_syscall_max] = &sys_ni_syscall,
	[0] = sys_read,
	[1] = sys_write,
	[2] = sys_open,
	...
	...
	...
};

After this all elements that points to the non-implemented system calls will contain address of the sys_ni_syscall function that just returns -ENOSYS as we saw above and other elements will point to the sys_syscall_name functions.

For this moment we have already filled system call table and the Linux kernel knows where is the certain system call handler. But the Linux kernel does not call a sys_syscall_name function right after it got a control to handle a system call from a user space application. Remember the chapter about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more before it will call an interrupt handler. There is the same situation with the system call handling. The preparation before a system call handling is an one thing, but before the Linux kernel will start to do these preparations, the entry point of a system call must be initailized and only than Linux kernel knows where to handle this preparations. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel.

Initialization of the system call entry

When a system call occurs in the system, where is the first bytes of code that starts to handle it? As we can read in the intel manual - 64-ia-32-architectures-software-developer-vol-2b-manual:

SYSCALL invokes an OS system-call handler at privilege level 0.
It does so by loading RIP from the IA32_LSTAR MSR

it means that we need to put system call entry to the IA32_LSTAR model specific register. This operation takes place during Linux kernel initialization process. If you have read the fourth part of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that Linux kernel calls trap_init function during the initialization process. This function defined in the arch/x86/kernel/setup.c source code file and executes initialization of the non-early exceptions handlres like divide error, coprocessor error and etc. Besides the initialization of the non-early exceptions handlers, this function calls the cpu_init function from the arch/x86/kernel/cpu/common.c source code file which besides initialization of per-cpu state, calls the syscall_init function from the same source code file.

That's just this function does all work for us by initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all fills two model specific registers:

wrmsrl(MSR_STAR,  ((u64)__USER32_CS)<<48  | ((u64)__KERNEL_CS)<<32);
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);

The first model specific register - MSR_STAR contains 63:48 bits of the user code segmet. This bits will be loaded to the CS and SS segment registers for the sysret instruction which provides functionality to return from a system call to user code with the related privilege. Also the MSR_STAR contains 47:32 bits from the kernel code that will be used as the base selector for CS and SS segment registers when user space application will execut a system call. In the second line of code we fill the MSR_LSTAR register with the entry_SYSCALL_64 symbol that represents system call entry. The entry_SYSCALL_64 defined in the arch/x86/entry/entry_64.S assembly file and contains code related to the preparation before a handler of a system call will be executed (I already wrote about these preparations, read above). We will not consider the entry_SYSCALL_64 now, but will return to it later in this chapter.

After we have set the entry point for system calls, we need to set following model specific registers:

  • MSR_CSTAR - target rip for the compability mode callers;
  • MSR_IA32_SYSENTER_CS - target cs for the sysenter instruction;
  • MSR_IA32_SYSENTER_ESP - target esp for the sysenter instruction;
  • MSR_IA32_SYSENTER_EIP - target eip for the sysenter instruction.

Values of these model specific register depends on the CONFIG_IA32_EMULATION kernel configuration option. If this kernel configuration option is enabled, it allows to run legacy 32-bit programs under a 64-bit kernel. In the first case, if the CONFIG_IA32_EMULATION kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compability mode:

wrmsrl(MSR_CSTAR, entry_SYSCALL_compat);

and with the kernel code segment, put zero to the stack pointer and write the address of the entry_SYSENTER_compat symbol to the instruction pointer:

wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);

In another way, if the CONFIG_IA32_EMULATION kernel configuration option is disabled, we write ignore_sysret symbol to the MSR_CSTAR:

wrmsrl(MSR_CSTAR, ignore_sysret);

that defined in the arch/x86/entry/entry_64.S assembly file and just returns -ENOSYS error code:

ENTRY(ignore_sysret)
	mov	$-ENOSYS, %eax
	sysret
END(ignore_sysret)

Now we need to fill MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP model specific registers as we did it in the previous code when the CONFIG_IA32_EMULATION kernel configuration option was enabled. In this case (when the CONFIG_IA32_EMULATION configuration option is not set) we fill the MSR_IA32_SYSENTER_ESP and the MSR_IA32_SYSENTER_EIP with zero and put invalid segment of the Global Descriptor Table to the MSR_IA32_SYSENTER_CS model specific register:

wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);

More about the Global Descriptor Table you can read in the second part of the chapter that describes booting process of the Linux kernel.

In the end of the syscall_init function, we just mask flags in the flags register by the writing set of flags to the MSR_SYSCALL_MASK model specific register:

wrmsrl(MSR_SYSCALL_MASK,
	   X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
	   X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);

These flags will be cleared during syscall initialization. That's all, it is the end of the syscall_init function and it means that entry for system calls is ready to work. Now we can see what will occur when an user application executes the syscall instruction.

Preparation before system call handler will be called

As I already wrote before a system call or an interrupt handler will be called by the Linux kernel, need to do some preparations. The idtentry macro does these preparations before an exception handler will be executed, the interrupt macro does these preparations before an interrupt handler will be called and the entry_SYSCALL_64 will do these preparations before a system call handler will be executed.

The entry_SYSCALL_64 defined in the arch/x86/entry/entry_64.S assembly file and starts from the following macro:

SWAPGS_UNSAFE_STACK

This macro defined in the arch/x86/include/asm/irqflags.h header file and expands to the swapgs instruction:

#define SWAPGS_UNSAFE_STACK	swapgs

which is exchanges the current GS base register value with the value contained in the MSR_KERNEL_GS_BASE model specific register. In other words we moved on the kernel stack. After this we put old stack pointer to the rsp_scratch per-cpu variable and setup stack pointer to the top of stack for the current processor:

movq	%rsp, PER_CPU_VAR(rsp_scratch)
movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp

In the next step we push the stack segment and the old stack pointer to the stack:

pushq	$__USER_DS
pushq	PER_CPU_VAR(rsp_scratch)

After this we enable interrupts, because interrupts are off on entry and save general purpose registers (besides bp, bx and from r12 to r15), flags, -ENOSYS for the non-implemented system call and code segment register on the stack:

ENABLE_INTERRUPTS(CLBR_NONE)

pushq	%r11
pushq	$__USER_CS
pushq	%rcx
pushq	%rax
pushq	%rdi
pushq	%rsi
pushq	%rdx
pushq	%rcx
pushq	$-ENOSYS
pushq	%r8
pushq	%r9
pushq	%r10
pushq	%r11
sub	$(6*8), %rsp

When a system call occurs from the user's application, general purpose registers have the following state:

  • rax - contains system call number;
  • rcx - contains return address to the user space;
  • r11 - contains register flags;
  • rdi - contains first argument of a system call handler;
  • rsi - contains second argument of a system call handler;
  • rdx - contains third argument of a system call handler;
  • r10 - contains fourth argument of a system call handler;
  • r8 - contains fifth argument of a system call handler;
  • r9 - contains sixth argument of a system call handler;

Other general purpose registers (as rbp, rbx and from r12 to r15) are callee-preserved in C ABI). So we push register flags on top of the stack, then user code segment, return address to the user space, system call number, first three arguments, dump error code for the non-implemented system call and other arguments on the stack.

In the next step we check the _TIF_WORK_SYSCALL_ENTRY in the current thread_info:

testl	$_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jnz	tracesys

The _TIF_WORK_SYSCALL_ENTRY macro defined in the arch/x86/include/asm/thread_info.h header file and provides set of the thread information flags that are related to the system calls tracing:

#define _TIF_WORK_SYSCALL_ENTRY \
    (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT |   \
    _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT |     \
    _TIF_NOHZ)

We will not consider debugging/tracing related stuff in this chapter, but will see it in the separate chapter that will devoted to the debugging and tracing technics in the Linux kernel. As we did not just on the tracesys label, the next label is the entry_SYSCALL_64_fastpath. In the entry_SYSCALL_64_fastpath we check the __SYSCALL_MASK that defined in the arch/x86/include/asm/unistd.h header file and

# ifdef CONFIG_X86_X32_ABI
#  define __SYSCALL_MASK (~(__X32_SYSCALL_BIT))
# else
#  define __SYSCALL_MASK (~0)
# endif

where the __X32_SYSCALL_BIT is

#define __X32_SYSCALL_BIT	0x40000000

As we can see the __SYSCALL_MASK depends on the CONFIG_X86_X32_ABI kernel configuration option and represents mask for the 32-bit ABI in the 64-bit kernel.

So we check the value of the __SYSCALL_MASK and if the CONFIG_X86_X32_ABI is disabled we compare the value of the rax register the the maximum syscall number (__NR_syscall_max), in another way if the CNOFIG_X86_X32_ABI is enabled we mask eax register with the __X32_SYSCALL_BIT and do the same comparison:

#if __SYSCALL_MASK == ~0
	cmpq	$__NR_syscall_max, %rax
#else
	andl	$__SYSCALL_MASK, %eax
	cmpl	$__NR_syscall_max, %eax
#endif

After this we check the result of the last comparison with the ja instruction that executes if CF an ZF flags are zero:

ja	1f

and if we have correct system call for this, we move fourth argument from the r10 to the rcx to keep x86_64 C ABI and execute call instruction with the address of a system call handler:

movq	%r10, %rcx
call	*sys_call_table(, %rax, 8)

Note, the sys_call_table is an array that we saw above in this part. As we already know the rax general purpose register contains number of a system call and each element of the sys_call_table is 8-bytes. So we are using *sys_call_table(, %rax, 8) this notation to find correct offset in the sys_call_table array for the certain system call handler.

That's all. We did all preparations and the system call handler was called for the certain interrupt handler, for example sys_read, sys_write or other system call handler that defined with the SYSCALL_DEFINE[N] macro in the Linux kernel code.

Exit from a system call

After a system call handler will finish its work, we will return back to the arch/x86/entry/entry_64.S, right after where we have called a system call handler:

call	*sys_call_table(, %rax, 8)

The next step as we've returned from a system call handler is to put return value of a system handler to the stack. We know that a system call returns result to the user program in the general purpose rax register, so we are moving its value after system call handler have finished its work to the stack:

movq	%rax, RAX(%rsp)

on the RAX place.

After this we can see the call of the LOCKDEP_SYS_EXIT macro from the arch/x86/include/asm/irqflags.h:

LOCKDEP_SYS_EXIT

Implementation of this macro depends on the CONFIG_DEBUG_LOCK_ALLOC kernel configuration option that allows us to debug locks on the exit from a system call. And again, we will not consider it in this chapter, but will return to it in the separate. In the end of the entry_SYSCALL_64 function we restore all general purpose registers besides rxc and r11, because the rcx register must contain return address to the application that called system call and the r11 register contains old flags register. After all general purpose registers are restored, we fill rcx with the return address, r11 register with the falgs and rsp with the old stack pointer:

RESTORE_C_REGS_EXCEPT_RCX_R11

movq	RIP(%rsp), %rcx
movq	EFLAGS(%rsp), %r11
movq	RSP(%rsp), %rsp

USERGS_SYSRET64

In the end we just call the USERGS_SYSRET64 macro that expands to the call of the swapgs instruction which exchanges again user GS and kernel GS and the sysretq instruction which executes exit from a system call handler:

#define USERGS_SYSRET64				\
	swapgs;	           				\
	sysretq;

Now we know what occurs when an user application calls a system call. Full path of this process is following:

  • User application contains code that fills general purposer register with the values (system call number and arguments of this system call);
  • Processor switches from the user mode to kernel mode and starts execution of the system call entry - entry_SYSCALL_64;
  • entry_SYSCALL_64 switches to the kernel stack and saves some general purpose registers, old stack and code segment, flags and etc... on the stack;
  • entry_SYSCALL_64 checks system call number in the rax register, searches a system call handler in the sys_call_table and calls it, if the number of a system call is correct;
  • If a system call is not correct, jump on exit from system call;
  • After a system call handler will finish its work, restore general purposer registers, old stack, flags and return address and exit from the entry_SYSCALL_64 with the sysretq instruction.

That's all.

Conclusion

This is the end of the second part about the system calls concept in the Linux kernel. In the previous part we saw theory about this concept from the user application view. In this part we continued to dive into the stuf which is related to the system call concept and saw what Linux kernel does when a system call occurs.

If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.