linux-insides/interrupts/interrupts-9.md
2015-08-09 20:11:15 +06:00

30 KiB

Interrupts and Interrupt Handling. Part 9.

Introduction to deferred interrupts (Softirq, Tasklets and irqs)

It is the ninth part of the linux-insides book and in the previous Previous part we saw implementation of the init_IRQ from that defined in the arch/x86/kernel/irqinit.c source code file. So, we will continue to dive into the initialization stuff which is related to the external hardware interrupts in this part.

After the init_IRQ function we can see the call of the softirq_init function in the init/main.c. This function defined in the kernel/softirq.c source code file and as we can understand from its name, this function makes initialization of the softirq or in other words initialization of the deferred interrupts. What is it deferreed intrrupt? We already saw a little bit about it in the ninth part of the chapter that describes initialization process of the Linux kernel. There are three types of deffered interrupts in the Linux kernel:

  • softirqs;
  • tasklets;
  • workqueues;

And we will see description of all of these types in this part. As I said, we saw only a little bit about this theme, so, now is time to dive deep into details about this theme.

Deferred interrupts

Interrupts may have different important characteristics and there are two among them:

  • Handler of an interrupt must execute quickly;
  • Sometime an interrupt handler must do a large amount of work.

As you can understand, it is almost impossible to make so that both characteristics were valid. Because of these, previously the handling of interrupts was splitted into two parts:

  • Top half;
  • Bottom half;

Once the Linux kernel was one of the ways the organization postprocessing, and which was called: the bottom half of the processor, but now it is already not actual. Now this term has remained as a common noun referring to all the different ways of organizing deffered processing of an interrupt. With the advent of parallelisms in the Linux kernel, all new schemes of implementation of the bottom half handlers are built on the performance of the processor specific kernel thread that called ksoftirqd (will be discussed below). The softirq mechanism represents handling of interrupts that are almost important as the handling of the hardware interrupts. The deferred processing of an interrupt suggests that some of the actions for an interrupt may be postponed to a later execution when the system will be less loaded. As you can suggests, an interrupt handler can do large amount of work that is impermissible as it executes in the context where interrupts are disabled. That's why processing of an interrupt can be splitted on two different parts. In the first part, the main handler of an interrupt does only minimal and the most important job. After this it schedules the second part and finishes its work. When the system less busy and context of the processor allows to handle interrupts, the second part starts its work and finishes to process remaing part of a deferred interrupt. That is main explanation of the deferred interrupt handling.

As I already wrote above, handling of deferred interrupts (or softirq in other words) and accordingly tasklets provided by a set of the special kernel threads (one thread per processor). Each processor has own thread and it is called ksoftirqd/n where the n is the number of the processor. We can see it in the output of the systemd-cgls util:

$ systemd-cgls -k | grep ksoft
├─   3 [ksoftirqd/0]
├─  13 [ksoftirqd/1]
├─  18 [ksoftirqd/2]
├─  23 [ksoftirqd/3]
├─  28 [ksoftirqd/4]
├─  33 [ksoftirqd/5]
├─  38 [ksoftirqd/6]
├─  43 [ksoftirqd/7]

The spawn_ksoftirqd function starts this these threads. As we can see this function called as early initcall:

early_initcall(spawn_ksoftirqd);

Deferred interrupts are determined statically at compile-time Linux kernel and the open_softirq function takes care of softirq initialization. The open_softirq function defined in the kernel/softirq.c:

void open_softirq(int nr, void (*action)(struct softirq_action *))
{
	softirq_vec[nr].action = action;
}

and as we can see this function uses two parameters:

  • the index of the softirq_vec array;
  • a pointer to the softirq function to be executed;

First of all let's look on the softirq_vec array:

static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;

it defined in the same source code file. As we can see, the softirq_vec array may contain NR_SOFTIRQS or 10 types of softirqs that has type - softirq_action. First of all about its elements. In the actual version of the Linux kernel there are ten softirq vectors defined; two for tasklet processing, two for networking, two for the block layer, two for timers, and one each for the scheduler and read-copy-update processing. All of these kinds represented by the following enum:

enum
{
        HI_SOFTIRQ=0,
        TIMER_SOFTIRQ,
        NET_TX_SOFTIRQ,
        NET_RX_SOFTIRQ,
        BLOCK_SOFTIRQ,
        BLOCK_IOPOLL_SOFTIRQ,
        TASKLET_SOFTIRQ,
        SCHED_SOFTIRQ,
        HRTIMER_SOFTIRQ,
        RCU_SOFTIRQ,
        NR_SOFTIRQS
};

All names of these kinds of softirqs represented by the following array:

const char * const softirq_to_name[NR_SOFTIRQS] = {
        "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
        "TASKLET", "SCHED", "HRTIMER", "RCU"
};

Or we can see it in the output of the /proc/softirqs:

~$ cat /proc/softirqs 
                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
          HI:          5          0          0          0          0          0          0          0
       TIMER:     332519     310498     289555     272913     282535     279467     282895     270979
      NET_TX:       2320          0          0          2          1          1          0          0
      NET_RX:     270221        225        338        281        311        262        430        265
       BLOCK:     134282         32         40         10         12          7          8          8
BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
     TASKLET:     196835          2          3          0          0          0          0          0
       SCHED:     161852     146745     129539     126064     127998     128014     120243     117391
     HRTIMER:          0          0          0          0          0          0          0          0
         RCU:     337707     289397     251874     239796     254377     254898     267497     256624

As we can see the softirq_vec array has softirq_action types. This is main data structure related to the softirq mechanism, so all softirqs represented by the softirq_action structure. The softirq_action structure consists only of one the field: an action pointer to the softirq function:

struct softirq_action
{
         void    (*action)(struct softirq_action *);
};

So, after this we can understand thatthe open_softirq function fills the softirq_vec array with the given softirq_action. The registered deferred interrupt (with the call of the open_softirq function) for it to be queued for execution, it should be activated by the call of the raise_softirq function. This function takes only one parameter - softirq index nr. Let's look on its implementation:

void raise_softirq(unsigned int nr)
{
        unsigned long flags;

        local_irq_save(flags);
        raise_softirq_irqoff(nr);
        local_irq_restore(flags);
}

Here we can see the call of the raise_softirq_irqoff function between the local_irq_save and the local_irq_restore macros. The local_irq_save defined in the include/linux/irqflags.h header file and saves the state of the IF flag of the eflags register and disables interrupts on the local processor. The local_irq_restore macro defined in the same header file and does the opposite thing: restores the interrupt flag and enables interrupts. We disable interrupts here because a softirq interrupt runs in interrupt context and that one softirq (and no others) will be run.

The raise_softirq_irqoff function marks the softirq as deffered by setting the bit corresponding to the given index nr in the softirq bit mask (__softirq_pending) of the local processor. It does it with the help of the:

__raise_softirq_irqoff(nr);

macro. After this, it checks the result of the in_interrupt that returns irq_count value. We already saw the irq_count in the first part of this chapter and it is used to check if a CPU is already on an interrupt stack or not. We just exit from the raise_softirq_irqoff, restore IF flang and enable interrupts on the local processor If we are in interrupt context. In another way we call the wakeup_softirqd:

if (!in_interrupt())
	wakeup_softirqd();

Where the wakeup_softirqd function activates the ksoftirqd kernel thread of the local processor:

static void wakeup_softirqd(void)
{
	struct task_struct *tsk = __this_cpu_read(ksoftirqd);

    if (tsk && tsk->state != TASK_RUNNING)
        wake_up_process(tsk);
}

Each ksoftirqd kernel thread runs the run_ksoftirqd function that checks existence of deferred interrupts and calls the __do_softirq function depends on result. This function reads the __softirq_pending softirq bit mask of the local processor and executes the deferrable functions corresponding to every set bit. During executing a deferred function, new pending softirqs might occur. The main problem here that execution of the userspace code can be delayed for a long time while the __do_softirq function will handle deferred interrupts. For this purpose, it has the limit of the time when it must be finsihed:

unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
...
...
...
restart:
while ((softirq_bit = ffs(pending))) {
	...
	h->action(h);
	...
}
...
...
...
pending = local_softirq_pending();
if (pending) {
	if (time_before(jiffies, end) && !need_resched() &&
		--max_restart)
            goto restart;
}
...			

Checks of the existence of the deferred interrupts performed periodically and there are some points where this check occurs. The main point where this situation occurs is the call of the do_IRQ function that defined in the arch/x86/kernel/irq.c and provides main possibilities for actual interrupt processing in the Linux kernel. When this function will finish to handle an interrupt, it calls the exiting_irq function from the arch/x86/include/asm/apic.h that expands to the call of the irq_exit function. The irq_exit checks deferred interrupts, current context and calls the invoke_softirq function:

if (!in_interrupt() && local_softirq_pending())
    invoke_softirq();

that executes the __do_softirq too. So what do we have in summary. Each softirq goes through the following stages: Registration of a softirq with the open_softirq function. Activation of a softirq by marking it as deferred with the raise_softirq function. After this, all marked softirqs will be runned in the next time the Linux kernel schedules a round of executions of deferrable functions. And execution of the deferred functions that have the same type.

As I already wrote, the softirqs are statically allocated and it is a problem for a kernel module that can be loaded. The second concept that built on top of softirq - the tasklets solves this problem.

Tasklets

If you will read source code of the Linux kernel that is related to the softirq, you will note that it is very rarely used there. The preferable way to implement deferrable functions are tasklets. As I already wrote above the tasklets are built on top of the softirq concept and generally on top of two softirqs:

  • TASKLET_SOFTIRQ;
  • HI_SOFTIRQ.

In sort words, tasklets are softirqs that can be allocated and initialized at runtime and unlike softirqs, tasklets that have the same type cannot be runned on a sereral processors in one time. Ok, now we know a little bit about the softirqs, of course previous text does not cover all aspects about this, but now we can directly look on the code and to know more about the softirqs step by step on practice and to know about tasklets. Let's back to the implementation of the softirq_init function that we talked about in the beginning of this part. This function defined in the kernel/softirq.c source code file, let's look on its implementation:

void __init softirq_init(void)
{
        int cpu;

        for_each_possible_cpu(cpu) {
                per_cpu(tasklet_vec, cpu).tail =
                        &per_cpu(tasklet_vec, cpu).head;
                per_cpu(tasklet_hi_vec, cpu).tail =
                        &per_cpu(tasklet_hi_vec, cpu).head;
        }

        open_softirq(TASKLET_SOFTIRQ, tasklet_action);
        open_softirq(HI_SOFTIRQ, tasklet_hi_action);
}

We can see defineition of the integer cpu variable at the beginning of the softirq_init function. Next we will use it as parameter for the for_each_possible_cpu macro that goes through the all possible processors in the system. If the possible processor is the new terminology for you, you can know more about it the CPU masks chapter. In short words, possible cpus is the set of processors that can be plugged in anytime during the life of that system boot. All possible processors stored in the cpu_possible_bits bitmap, you can find its definition in the kernel/cpu.c:

static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly;
...
...
...
const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits);

Ok, we defined the integer cpu variable and go through the all possible processors with the for_each_possible_cpu macro and makes initialization of the two following per-cpu variables:

  • tasklet_vec;
  • tasklet_hi_vec;

These two per-cpu variables defined in the same source code file as the softirq_init function and represent two tasklet_head structures:

static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec);
static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec);

Where tasklet_head structure represent list of Tasklets and contains two fields, head and tail:

struct tasklet_head {
        struct tasklet_struct *head;
        struct tasklet_struct **tail;
};

The tasklet_struct structure defined in the include/linux/interrupt.h and represents the Tasklet. Previously we did not see this word in this book. Let's try to understand what is it tasklet. Actually, the tasklet is an one of mechanisms to handle deferred interrupt. Let's look on the implementation if the tasklet_struct structure:

struct tasklet_struct
{
        struct tasklet_struct *next;
        unsigned long state;
        atomic_t count;
        void (*func)(unsigned long);
        unsigned long data;
};

As we can see this structure contains five fields, they are:

  • Next tasklet in the scheduling queue;
  • State of the tasklet;
  • Represent current state of the tasklet, active or not;
  • Main callback of the tasklet;
  • Parameter of the callback.

In our case, we set only for initialize only two arrays of tasklets in the softirq_init function: the tasklet_vec and the tasklet_hi_vec. Tasklets and high-priority tasklets are stored in the tasklet_vec and tasklet_hi_vec arrays, respectively. So, we have initialized these arrays and now we can see two calls of the open_softirq function that defined in the kernel/softirq.c source code file:

open_softirq(TASKLET_SOFTIRQ, tasklet_action);
open_softirq(HI_SOFTIRQ, tasklet_hi_action);

in the end of the softirq_init function. The main purpose of the open_softirq function is initalization of softirq. Let's look on the implementation of the open_softirq function.

, in our case they are: tasklet_action and the tasklet_hi_action or the softirq function associated with the HI_SOFTIRQ softirq is named tasklet_hi_action and softirq function associated with the TASKLET_SOFTIRQ is named tasklet_action. The Linux kernel provides API for the manipulating of tasklets. First of all it is the tasklet_init function that takes tasklet_struct, function and parameter for it and initializes the given tasklet_struct with the given data:

void tasklet_init(struct tasklet_struct *t,
                  void (*func)(unsigned long), unsigned long data)
{
    t->next = NULL;
    t->state = 0;
    atomic_set(&t->count, 0);
    t->func = func;
    t->data = data;
}

There are additional methods to initialize a tasklet statically with the two following macros:

DECLARE_TASKLET(name, func, data);
DECLARE_TASKLET_DISABLED(name, func, data);	

The Linux kernel provides three following functions to mark a tasklet as ready to run:

void tasklet_schedule(struct tasklet_struct *t);
void tasklet_hi_schedule(struct tasklet_struct *t);
void tasklet_hi_schedule_first(struct tasklet_struct *t);

The first function schedules a tasklet with the normal priory, the second with the high priority and the third out of turn. Implementation of the all of these three functions is similar, so we will consider only the first - tasklet_schedule. Let's look on its implementation:

static inline void tasklet_schedule(struct tasklet_struct *t)
{
    if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
        __tasklet_schedule(t);
}

void __tasklet_schedule(struct tasklet_struct *t)
{
        unsigned long flags;

        local_irq_save(flags);
        t->next = NULL;
        *__this_cpu_read(tasklet_vec.tail) = t;
        __this_cpu_write(tasklet_vec.tail, &(t->next));
        raise_softirq_irqoff(TASKLET_SOFTIRQ);
        local_irq_restore(flags);
}

As we can see it checks and set the state of the given tasklet to the TASKLET_STATE_SCHED and executes the __tasklet_schedule with the given tasklet. The __tasklet_schedule looks very similar on the raise_softirq function that we saw above. It saves the interrupt flag and disables interrupts at the beginning. After this it updates tasklet_vec with the new tasklet and calls the raise_softirq_irqoff function that we saw above. When the Linux kernel scheduler will decide to run deferred functions, the tasklet_action function will be called for deferred functions which are associated with the TASKLET_SOFTIRQ and tasklet_hi_action for deferred functions which are associated with the HI_SOFTIRQ. These functions are very similar and there is only one difference between they is that the tasklet_action uses tasklet_vec, but the tasklet_hi_action uses tasklet_hi_vec.

Let's look on the implementation of the tasklet_action function:

static void tasklet_action(struct softirq_action *a)
{
    local_irq_disable();
    list = __this_cpu_read(tasklet_vec.head);
    __this_cpu_write(tasklet_vec.head, NULL);
    __this_cpu_write(tasklet_vec.tail, this_cpu_ptr(&tasklet_vec.head));
    local_irq_enable();

    while (list) {
		if (tasklet_trylock(t)) {
	        t->func(t->data);
            tasklet_unlock(t);
	    }
		...
		...
		...
    }
}

In the beginning of the tasketl_action function we disable interrupts for the local processor with the help of the local_irq_disable macro (you can read about this macro in the second part of this chapter). In the next step we take a head of the list that contains tasklets with normal priority and set this percpu list to NULL because all tasklets must be executed in a generaly way. After this we enable interrupts for the local processor and go through the list of taklets in the loop. In the every iteration of the loop we call the tasklet_trylock function for the given tasklet that update state of the given tasklet on TASKLET_STATE_RUN:

static inline int tasklet_trylock(struct tasklet_struct *t)
{
    return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);
}

If this operation was successful we execute tasklet's action (it was set in the tasklet_init) and call the tasklet_unlock function that clears tasklet's TASKLET_STATE_RUN state.

In generall that's all about tasklets concept. Of course this does not cover full tasklets, but I think that it is good point from where you can continue to learn this concept.

The tasklets are widely used concept in the Linux kernel, but as I wrote in the beginning of this part there is third mechanism for deferred functions - workqueue. In the next paragraph we will know what is it.

Workqueues

The workqueue is another concept for handling deferred functions. It is similar on tasklets, but has some differences. Workqueue functions run in the context of a kernel process, but tasklet functions run in software interrupt context. This means that workqueue functions must not be atomic as tasklet functions. Tasklets always run on the processor from which they were originally submitted. Workqueues work in the same way, but only by default. The workqueue concept represented by the:

struct worker_pool {
    spinlock_t              lock;
    int                     cpu;
    int                     node;
    int                     id;
    unsigned int            flags;

    struct list_head        worklist;
    int                     nr_workers;
...
...
...

structure that defined in the kernel/workqueue.c source code file in the Linux kernel. I will not write the source code of this structure here, because it has quite a bit of fileds, but we will consider some of their feilds.

In its most basic form, the work queue subsystem is an interface for creating kernel threads to handle work that is queued from elsewhere. All of these kernel threads are called - worker threads. The work queue are maintained by the work_struct that defined in the include/linux/workqueue.h. Let's look on this structure:

struct work_struct {
    atomic_long_t data;
    struct list_head entry;
    work_func_t func;
#ifdef CONFIG_LOCKDEP
    struct lockdep_map lockdep_map;
#endif
};

Here are two things that we are interested: func - the function that will be scheduled by the workqueue and the data - parameter of this function. The Linux kernel provides special per-cpu threads that are called - kworker:

systemd-cgls -k | grep kworker
├─    5 [kworker/0:0H]
├─   15 [kworker/1:0H]
├─   20 [kworker/2:0H]
├─   25 [kworker/3:0H]
├─   30 [kworker/4:0H]
...
...
...

This process can be used to schedule the deferred functions of the workqueues (as ksoftirqd for softirqs). Besides this we can create new separate worker thread for a workqueue. The Linux kernel provides following macros for the creation of workqueue:

#define DECLARE_WORK(n, f) \
    struct work_struct n = __WORK_INITIALIZER(n, f)

for static creation. It takes two parameters: name of the workqueue and the workqueue function. For creation of workqueue in runtime, we can use the:

#define INIT_WORK(_work, _func)       \
    __INIT_WORK((_work), (_func), 0)

#define __INIT_WORK(_work, _func, _onstack)                     \
    do {                                                        \
            __init_work((_work), _onstack);                     \
            (_work)->data = (atomic_long_t) WORK_DATA_INIT();   \
            INIT_LIST_HEAD(&(_work)->entry);                    \
             (_work)->func = (_func);                           \
    } while (0)

macro that takes work_struct structure that has to be created and the function to be scheduled in this workqueue. After a work was created with the one of these macros, we need to put it to the workqueue. We can do it with the help of the queue_work or the queue_delayed_work functions:

static inline bool queue_work(struct workqueue_struct *wq,
                              struct work_struct *work)
{
    return queue_work_on(WORK_CPU_UNBOUND, wq, work);
}

The queue_work function just calls the queue_work_on function that queue work on specific processor. Note that in our case we pass the WORK_STRUCT_PENDING_BIT to the queue_work_on function. It is a part of the enum that defined in the include/linux/workqueue.h and represents workqueue which are not bound to any specific processor. The queue_work_on function tests and set the WORK_STRUCT_PENDING_BIT bit of the given work and executes the __queue_work function with the workqueue for the given processor and given work:

__queue_work(cpu, wq, work);

The __queue_work function gets the work pool. Yes not workqueue, but work pool. Actually all works are placed not in the workqueue, but to the work pool that represented by the worker_pool structure in the Linux kernel. As you can see above, the workqueue_struct structure has the pwqs field which is list of worker_pools. When we create a workqueue, it stands out for each processor the pool_workqueue. Each pool_workqueue associated with worker_pool, which is allocated on the same processor and corresponds to the type of priority queue. Through them workqueue interacts with worker_pool. So in the __queue_work function we set the cpu to the current processor with the raw_smp_processor_id (you can find information about this marco in the fouth part of the Linux kernel initialization process chapter), geting the pool_workqueue for the given workqueue_struct and insert the given work to the given workqueue:

static void __queue_work(int cpu, struct workqueue_struct *wq,
                         struct work_struct *work)
{
...
...
...
if (req_cpu == WORK_CPU_UNBOUND)
    cpu = raw_smp_processor_id();

if (!(wq->flags & WQ_UNBOUND))
    pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
else
    pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
...
...
...
insert_work(pwq, work, worklist, work_flags);

As we can create works and workqueue, we need to know when they are execute. As I already wrote, all works executed by the kernel thread. When this kernel thread scheduled, it starts to execute works from the given workqueue. Each worker thread executes a loop inside the worker_thread function. This thread makes many different things and part of these things are similar to that we saw before in this part. As it started to work, it removes all work_struct or works from its workqueue.

That's all.

Conclusion

It is the end of the ninth part of the Interrupts and Interrupt Handling chapter and we continued to dive into external hardware interrupts in this part. In the previous part we saw initialization of the IRQs and main irq_desc structure. In this part we saw three concepts: the softirq, tasklet and workqueue that are used for the deferred functions.

The next part will be last part of the Interrupts and Interrupt Handling chapter and we will look on the real hardware driver and will try to learn how it works with the interrupts subsystem.

If you will have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to linux-internals.