linux-insides/interrupts/interrupts-3.md
2015-10-16 22:06:58 +08:00

22 KiB

Interrupts and Interrupt Handling. Part 3.

Interrupt handlers

This is the third part of the chapter about an interrupts and an exceptions handling and in the previous part we stoped in the setup_arch function from the arch/x86/kernel/setup.c on the setting of the two exceptions handlers for the two following exceptions:

  • #DB - debug exception, transfers control from the interrupted process to the debug handler;
  • #BP - breakpoint exception, caused by the int 3 instruction.

These exceptions allow the x86_64 architecture to have early exception processing for the purpose of debugging via the kgdb.

As you can remember we set these exceptions handlers in the early_trap_init function:

void __init early_trap_init(void)
{
        set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
        set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
        load_idt(&idt_descr);
}

from the arch/x86/kernel/traps.c. We already saw implementation of the set_intr_gate_ist and set_system_intr_gate_ist functions in the previous part and now we will look on the implementation of these early exceptions handlers.

Debug and Breakpoint exceptions

Ok, we set the interrupts gates in the early_trap_init function for the #DB and #BP exceptions and now time is to look on their handlers. But first of all let's look on these exceptions. The first exceptions - #DB or debug exception occurs when a debug event occurs, for example attempt to change the contents of a debug register. Debug registers are special registers which present in processors starting from the Intel 80386 and as you can understand from its name they are used for debugging. These registers allow to set breakpoints on the code and read or write data to trace, thus tracking the place of errors. The debug registers are privileged resources available and the program in either real-address or protected mode at CPL is 0, that's why we have used set_intr_gate_ist for the #DB, but not the set_system_intr_gate_ist. The verctor number of the #DB exceptions is 1 (we pass it as X86_TRAP_DB) and has no error code:

----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description         |Type |Error Code|Source                                |
----------------------------------------------------------------------------------------------
|1     | #DB    |Reserved            |F/T  |NO        |                                      |
----------------------------------------------------------------------------------------------

The second is #BP or breakpoint exception occurs when processor executes the INT 3 instruction. We can add it anywhere in our code, for example let's look on the simple program:

// breakpoint.c
#include <stdio.h>

int main() {
    int i;
    while (i < 6){
	    printf("i equal to: %d\n", i);
	    __asm__("int3");
		++i;
    }
}

If we will compile and run this program, we will see following output:

$ gcc breakpoint.c -o breakpoint
i equal to: 0
Trace/breakpoint trap

But if will run it with gdb, we will see our breakpoint and can continue execution of our program:

$ gdb breakpoint
...
...
...
(gdb) run
Starting program: /home/alex/breakpoints 
i equal to: 0

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400585 in main ()
=> 0x0000000000400585 <main+31>:	83 45 fc 01	add    DWORD PTR [rbp-0x4],0x1
(gdb) c
Continuing.
i equal to: 1

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400585 in main ()
=> 0x0000000000400585 <main+31>:	83 45 fc 01	add    DWORD PTR [rbp-0x4],0x1
(gdb) c
Continuing.
i equal to: 2

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000400585 in main ()
=> 0x0000000000400585 <main+31>:	83 45 fc 01	add    DWORD PTR [rbp-0x4],0x1
...
...
...

Now we know a little about these two exceptions and we can move on to consideration of their handlers.

Preparation before an interrupt handler

As you can note, the set_intr_gate_ist and set_system_intr_gate_ist functions takes an addresses of the exceptions handlers in the second parameter:

  • &debug;
  • &int3.

You will not find these functions in the C code. All that can be found in in the *.c/*.h files only definition of this functions in the arch/x86/include/asm/traps.h:

asmlinkage void debug(void);
asmlinkage void int3(void);

But we can see asmlinkage descriptor here. The asmlinkage is the special specificator of the gcc. Actually for a C functions which are will be called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with asmlinkage descriptor, then gcc will compile the function to retrieve parameters from stack. So, both handlers are defined in the arch/x86/kernel/entry_64.S assembly source code file with the idtentry macro:

idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK

Actually debug and int3 are not interrupts handlers. Remember that before we can execute an interrupt/exception handler, we need to do some preparations as:

  • When an interrupt or exception occured, the processor uses an exception or interrupt vector as an index to a descriptor in the IDT;
  • In legacy mode ss:esp registers are pushed on the stack only if privilege level changed. In 64-bit mode ss:rsp pushed on the stack everytime;
  • During stack switching with IST the new ss selector is forced to null. Old ss and rsp are pushed on the new stack.
  • The rflags, cs, rip and error code pushed on the stack;
  • Control transfered to an interrupt handler;
  • After an interrupt handler will finish its work and finishes with the iret instruction, old ss will be poped from the stack and loaded to the ss register.
  • ss:rsp will be popped from the stack unconditionally in the 64-bit mode and will be popped only if there is a privilege level change in legacy mode.
  • iret instruction will restore rip, cs and rflags;
  • Interrupted program will continue its execution.
    +--------------------+
+40 |        ss          |
+32 |       rsp          |
+24 |      rflags        |
+16 |        cs          |
 +8 |       rip          |
  0 |    error code      |
    +--------------------+

Now we can see on the preparations before a process will transfer control to an interrupt/exception handler from practical side. As I already wrote above the first thirteen exceptions handlers defined in the arch/x86/kernel/entry_64.S assembly file with the idtentry macro:

.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
...
...
...
END(\sym)
.endm

This macro defines an exception entry point and as we can see it takes five arguments:

  • sym - defines global symbol with the .globl name.
  • do_sym - an interrupt handler.
  • has_error_code:req - information about error code, The :req qualifier tells the assembler that the argument is required;
  • paranoid - shows us how we need to check current mode;
  • shift_ist - shows us what's stack to use;

As we can see our exceptions handlers are almost the same:

idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK

The differences are only in the global name and name of exceptions handlers. Now let's look how idtentry macro implemented. It starts from the two checks:

	.if \shift_ist != -1 && \paranoid == 0
	.error "using shift_ist requires paranoid=1"
	.endif

	.if \has_error_code
	XCPT_FRAME
	.else
	INTR_FRAME
	.endif

First check makes the check that an exceptions uses Interrupt stack table and paranoid is set, in other way it emits the erorr with the .error directive. The second if clause checks existence of an error code and calls XCPT_FRAME or INTR_FRAME macros depends on it. These macros just expand to the set of CFI directives which are used by GNU AS to manage call frames. The CFI directives are used only to generate dwarf2 unwind information for better backtraces and they don't change any code, so we will not go into detail about it and from this point I will skip all code which is related to these directives. In the next step we check error code again and push it on the stack if an exception has it with the:

.ifeq \has_error_code
	pushq_cfi $-1
.endif

The pushq_cfi macro defined in the arch/x86/include/asm/dwarf2.h and expands to the pushq instruction which pushes given error code:

	.macro pushq_cfi reg
	pushq \reg
	CFI_ADJUST_CFA_OFFSET 8
	.endm

Pay attention on the $-1. We already know that when an exception occrus, the processor pushes ss, rsp, rflags, cs and rip on the stack:

#define RIP		16*8
#define CS		17*8
#define EFLAGS	18*8
#define RSP		19*8
#define SS		20*8

With the pushq \reg we denote that place before the RIP will contain error code of an exception:

#define ORIG_RAX	15*8

The ORIG_RAX will contain error code of an exception, IRQ number on a hardware interrupt and system call number on system call entry. In the next step we can see thr ALLOC_PT_GPREGS_ON_STACK macro which allocates space for the 15 general purpose registers on the stack:

.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
subq	$15*8+\addskip, %rsp
CFI_ADJUST_CFA_OFFSET 15*8+\addskip
.endm

After this we check paranoid and if it is set we check first three CPL bits. We compare it with the 3 and it allows us to know did we come from userspace or not:

.if \paranoid
  .if \paranoid == 1
    CFI_REMEMBER_STATE
	testl $3, CS(%rsp)
	jnz 1f
  .endif
  call paranoid_entry
.else
  call error_entry
.endif

If we came from userspace we jump on the label 1 which starts from the call error_entry instruction. The error_entry saves all registers in the pt_regs structure which presetens an interrupt/exception stack frame and defined in the arch/x86/include/uapi/asm/ptrace.h. It saves common and extra registers on the stack with the:

SAVE_C_REGS 8
SAVE_EXTRA_REGS 8

from rdi to r15 and executes swapgs instruction. This instruction provides a method to for the Linux kernel to obtain a pointer to the kernel data structures and save the user's gsbase. After this we will exit from the error_entry with the ret instruction. After the error_entry finished to execute, since we came from userspace we need to switch on kernel interrupt stack:

	movq %rsp,%rdi
	call sync_regs

We just save all registers to the error_entry in the error_entry, we put address of the pt_regs to the rdi and call sync_regs function from the arch/x86/kernel/traps.c:

asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
{
	struct pt_regs *regs = task_pt_regs(current);
	*regs = *eregs;
	return regs;
}

This function switchs off the IST stack if we came from usermode. After this we switch on the stack which we got from the sync_regs:

movq %rax,%rsp
movq %rsp,%rdi

and put pointer of the pt_regs again in the rdi, and in the last step we call an exception handler:

call \do_sym

So, realy exceptions handlers are do_debug and do_int3 functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if paranoid is set, but it is not 1, we call paranoid_entry which makes almost the same that error_entry, but it checks current mode with more slow but accurate way:

ENTRY(paranoid_entry)
	SAVE_C_REGS 8
	SAVE_EXTRA_REGS 8
	...
	...
	movl $MSR_GS_BASE,%ecx
	rdmsr
	testl %edx,%edx
	js 1f	/* negative -> in kernel */
	SWAPGS
	...
	...
	ret
END(paranoid_entry)

If edx wll be negative, we are in the kernel mode. As we store all registers on the stack, check that we are in the kernel mode, we need to setup IST stack if it is set for a given exception, call an exception handler and restore the exception stack:

	.if \shift_ist != -1
	subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
	.endif

	call \do_sym

	.if \shift_ist != -1
	addq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
	.endif

The last step when an exception handler will finish it's work all registers will be restored from the stack with the RESTORE_C_REGS and RESTORE_EXTRA_REGS macros and control will be returned an interrupted task. That's all. Now we know about preparation before an interrupt/exception handler will start to execute and we can go directly to the implementation of the handlers.

Implementation of ainterrupts and exceptions handlers

Both handlers do_debug and do_int3 defined in the arch/x86/kernel/traps.c source code file and have two similar things: All interrupts/exceptions handlers marked with the dotraplinkage prefix that expands to the:

#define dotraplinkage __visible
#define __visible __attribute__((externally_visible))

which tells to compiler that something else uses this function (in our case these functions are called from the assembly interrupt preparation code). And also they takes two parameters:

  • pointer to the pt_regs structure which contains registers of the interrupted task;
  • error code.

First of all let's consider do_debug handler. This function starts from the getting previous state with the ist_enter function from the arch/x86/kernel/traps.c. We call it because we need to know, did we come to the interrupt handler from the kernel mode or user mode.

prev_state = ist_enter(regs);

The ist_enter function returns previous state context state and executes a couple preprartions before we continue to handle an exception. It starts from the check of the previous mode with the user_mode_vm macro. It takes pt_regs structure which contains a set of registers of the interrupted task and returns 1 if we came from userspace and 0 if we came from kernel space. According to the previous mode we execute exception_enter if we are from the userspace or inform RCU if we are from krenel space:

...
if (user_mode_vm(regs)) {
	prev_state = exception_enter();
} else {
	rcu_nmi_enter();
	prev_state = IN_KERNEL;
}
...
...
...
return prev_state;

After this we load the DR6 debug registers to the dr6 variable with the call of the get_debugreg macro from the arch/x86/include/asm/debugreg.h:

get_debugreg(dr6, 6);
dr6 &= ~DR6_RESERVED;

The DR6 debug register is debug status register contains information about the reason for stopping the #DB or debug exception handler. After we loaded its value to the dr6 variable we filter out all reserved bits (4:12 bits). In the next step we check dr6 register and previous state with the following if condition expression:

if (!dr6 && user_mode_vm(regs))
	user_icebp = 1;

If dr6 does not show any reasons why we caught this trap we set user_icebp to one which means that user-code wants to get SIGTRAP signal. In the next step we check was it kmemcheck trap and if yes we go to exit:

if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
	goto exit;

After we did all these checks, we clear the dr6 register, clear the DEBUGCTLMSR_BTF flag which provides single-step on branches debugging, set dr6 register for the current thread and increase debug_stack_usage [per-cpu](Per-CPU variables) variable with the:

set_debugreg(0, 6);
clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP);
tsk->thread.debugreg6 = dr6;
debug_stack_usage_inc();

As we saved dr6, we can allow irqs:

static inline void preempt_conditional_sti(struct pt_regs *regs)
{
        preempt_count_inc();
        if (regs->flags & X86_EFLAGS_IF)
                local_irq_enable();
}

more about local_irq_enabled and related stuff you can read in the second part about interrupts handling in the Linux kernel. In the next step we check the previous mode was virtual 8086 and handle the trap:

if (regs->flags & X86_VM_MASK) {
	handle_vm86_trap((struct kernel_vm86_regs *) regs, error_code, X86_TRAP_DB);
	  preempt_conditional_cli(regs);
      debug_stack_usage_dec();
	  goto exit;
}
...
...
...
exit:
	ist_exit(regs, prev_state);

If we came not from the virtual 8086 mode, we need to check dr6 register and previous mode as we did it above. Here we check if step mode debugging is enabled and we are not from the user mode, we enabled step mode debugging in the dr6 copy in the current thread, set TIF_SINGLE_STEP falg and re-enable Trap flag for the user mode:

if ((dr6 & DR_STEP) && !user_mode(regs)) {
        tsk->thread.debugreg6 &= ~DR_STEP;
        set_tsk_thread_flag(tsk, TIF_SINGLESTEP);
        regs->flags &= ~X86_EFLAGS_TF;
}

Then we get SIGTRAP signal code:

si_code = get_si_code(tsk->thread.debugreg6);

and send it for user icebp traps:

if (tsk->thread.debugreg6 & (DR_STEP | DR_TRAP_BITS) || user_icebp)
	send_sigtrap(tsk, regs, error_code, si_code);
preempt_conditional_cli(regs);
debug_stack_usage_dec();
exit:
	ist_exit(regs, prev_state);

In the end we disabled irqs, decrement value of the debug_stack_usage and exit from the exception handler with the ist_exit function.

The second exception handler is do_int3 defined in the same source code file - arch/x86/kernel/traps.c. In the do_int3 we makes almost the same that in the do_debug handler. We get the previous state with the ist_enter, increment and decrement the debug_stack_usage per-cpu variable, enabled and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and than sync processor cores during breakpoint patching.

That's all.

Conclusion

It is the end of the third part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the Interrupt descriptor table in the previous part with the #DB and #BP gates and started to dive into preparation before control will be transfered to an exception handler and implementation of some interrupt handlers in this part. In the next part we will continue to dive into this theme and will go next by the setup_arch function and will try to understand interrupts handling related stuff.

If you will have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to linux-insides.