linux-insides/MM/linux-mm-1.md

21 KiB

Linux kernel memory management Part 1.

Introduction

Memory management is one of the most complex (and I think that it is the most complex) part of the operating system kernel. In the last preparations before the kernel entry point part we stopped right before call of the start_kernel function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first init process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No complicated memory management is working yet. When the start_kernel function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the memblock.

Memblock

Memblock is one of the methods of managing memory regions during the early bootstrap period while the usual kernel memory allocators are not up and running yet. Previously it was called Logical Memory Block, but with the patch by Yinghai Lu, it was renamed to the memblock. As Linux kernel for x86_64 architecture uses this method. We already met memblock in the Last preparations before the kernel entry point part. And now it's time to get acquainted with it closer. We will see how it is implemented.

We will start to learn memblock from the data structures. Definitions of the all data structures can be found in the include/linux/memblock.h header file.

The first structure has the same name as this part and it is:

struct memblock {
         bool bottom_up;
         phys_addr_t current_limit;
         struct memblock_type memory;   --> array of memblock_region
         struct memblock_type reserved; --> array of memblock_region
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
         struct memblock_type physmem;
#endif
};

This structure contains five fields. First is bottom_up which allows allocating memory in bottom-up mode when it is true. Next field is current_limit. This field describes the limit size of the memory block. The next three fields describe the type of the memory block. It can be: reserved, memory and physical memory if the CONFIG_HAVE_MEMBLOCK_PHYS_MAP configuration option is enabled. Now we see yet another data structure - memblock_type. Let's look at its definition:

struct memblock_type {
	unsigned long cnt;
	unsigned long max;
	phys_addr_t total_size;
	struct memblock_region *regions;
};

This structure provides information about the memory type. It contains fields which describe the number of memory regions which are inside the current memory block, the size of all memory regions, the size of the allocated array of the memory regions and pointer to the array of the memblock_region structures. memblock_region is a structure which describes a memory region. Its definition is:

struct memblock_region {
        phys_addr_t base;
        phys_addr_t size;
        unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
        int nid;
#endif
};

memblock_region provides the base address and size of the memory region as well as a flags field which can have the following values:

enum {
    MEMBLOCK_NONE	= 0x0,	/* No special request */
    MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
    MEMBLOCK_MIRROR	= 0x2,	/* mirrored region */
    MEMBLOCK_NOMAP	= 0x4,	/* don't add to kernel direct mapping */
};

Also memblock_region provides an integer field - numa node selector, if the CONFIG_HAVE_MEMBLOCK_NODE_MAP configuration option is enabled.

Schematically we can imagine it as:

+---------------------------+   +---------------------------+
|         memblock          |   |                           |
|  _______________________  |   |                           |
| |        memory         | |   |       Array of the        |
| |      memblock_type    |-|-->|      memblock_region      |
| |_______________________| |   |                           |
|                           |   +---------------------------+
|  _______________________  |   +---------------------------+
| |       reserved        | |   |                           |
| |      memblock_type    |-|-->|       Array of the        |
| |_______________________| |   |      memblock_region      |
|                           |   |                           |
+---------------------------+   +---------------------------+

These three structures: memblock, memblock_type and memblock_region are main in the Memblock. Now we know about it and can look at Memblock initialization process.

Memblock initialization

As all API of the memblock are described in the include/linux/memblock.h header file, all implementations of these functions are in the mm/memblock.c source code file. Let's look at the top of the source code file and we will see the initialization of the memblock structure:

struct memblock memblock __initdata_memblock = {
	.memory.regions		= memblock_memory_init_regions,
	.memory.cnt		    = 1,
	.memory.max		    = INIT_MEMBLOCK_REGIONS,

	.reserved.regions	= memblock_reserved_init_regions,
	.reserved.cnt		= 1,
	.reserved.max		= INIT_MEMBLOCK_REGIONS,

#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
	.physmem.regions	= memblock_physmem_init_regions,
	.physmem.cnt		= 1,
	.physmem.max		= INIT_PHYSMEM_REGIONS,
#endif
	.bottom_up		    = false,
	.current_limit		= MEMBLOCK_ALLOC_ANYWHERE,
};

Here we can see initialization of the memblock structure which has the same name as structure - memblock. First of all note the __initdata_memblock. Definition of this macro looks like:

#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
    #define __init_memblock __meminit
    #define __initdata_memblock __meminitdata
#else
    #define __init_memblock
    #define __initdata_memblock
#endif

You can see that it depends on CONFIG_ARCH_DISCARD_MEMBLOCK. If this configuration option is enabled, memblock code will be put into the .init section and will be released after the kernel is booted up.

Next we can see the initialization of the memblock_type memory, memblock_type reserved and memblock_type physmem fields of the memblock structure. Here we are interested only in the memblock_type.regions initialization process. Note that every memblock_type field is initialized by and array of memblock_regions:

static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
static struct memblock_region memblock_physmem_init_regions[INIT_PHYSMEM_REGIONS] __initdata_memblock;
#endif

Every array contains 128 memory regions. We can see it in the INIT_MEMBLOCK_REGIONS macro definition:

#define INIT_MEMBLOCK_REGIONS   128

Note that all arrays are also defined with the __initdata_memblock macro which we already saw in the memblock structure initialization (read above if you've forgotten).

The last two fields describe that bottom_up allocation is disabled and the limit of the current Memblock is:

#define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)

which is 0xffffffffffffffff.

On this step the initialization of the memblock structure has been finished and we can have a look at the Memblock API.

Memblock API

Ok we have finished with the initialization of the memblock structure and now we can look at the Memblock API and its implementation. As I said above, the implementation of memblock is taking place fully in mm/memblock.c. To understand how memblock works and how it is implemented, let's look at its usage first. There are a couple of places in the linux kernel where memblock is used. For example let's take memblock_x86_fill function from the arch/x86/kernel/e820.c. This function goes through the memory map provided by the e820 and adds memory regions reserved by the kernel to the memblock with the memblock_add function. Since we have met the memblock_add function first, let's start from it.

This function takes a physical base address and the size of the memory region as arguments and add them to the memblock. The memblock_add function does not do anything special in its body, but just calls the:

memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);

function. We pass the memory block type - memory, the physical base address and the size of the memory region, the maximum number of nodes which is 1 if CONFIG_NODES_SHIFT is not set in the configuration file or 1 << CONFIG_NODES_SHIFT if it is set, and the flags. The memblock_add_range function adds a new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, memblock_add_range checks the existence of the memory regions in the memblock structure with the given memblock_type. If there are no memory regions, we just fill new a memory_region with the given values and return (we already saw the implementation of this in the First touch of the linux kernel memory manager framework). If memblock_type is not empty, we start to add a new memory region to the memblock with the given memblock_type.

First of all we get the end of the memory region with the:

phys_addr_t end = base + memblock_cap_size(base, &size);

memblock_cap_size adjusts size that base + size will not overflow. Its implementation is pretty easy:

static inline phys_addr_t memblock_cap_size(phys_addr_t base, phys_addr_t *size)
{
	return *size = min(*size, (phys_addr_t)ULLONG_MAX - base);
}

memblock_cap_size returns the new size which is the smallest value between the given size and ULLONG_MAX - base.

After that we have the end address of the new memory region, memblock_add_range checks for overlap and merge conditions with memory regions that have been added before. Insertion of the new memory region to the memblock consists of two steps:

  • Adding of non-overlapping parts of the new memory area as separate regions;
  • Merging of all neighboring regions.

We are going through all the already stored memory regions and checking for overlap with the new region:

	for (i = 0; i < type->cnt; i++) {
		struct memblock_region *rgn = &type->regions[i];
		phys_addr_t rbase = rgn->base;
		phys_addr_t rend = rbase + rgn->size;

		if (rbase >= end)
			break;
		if (rend <= base)
			continue;
        ...
		...
		...
	}

If the new memory region does not overlap with regions which are already stored in the memblock, insert this region into the memblock with and this is first step, we check if the new region can fit into the memory block and call memblock_double_array in another way:

while (type->cnt + nr_new > type->max)
	if (memblock_double_array(type, obase, size) < 0)
		return -ENOMEM;
	insert = true;
	goto repeat;

memblock_double_array doubles the size of the given regions array. Then we set insert to true and go to the repeat label. In the second step, starting from the repeat label we go through the same loop and insert the current memory region into the memory block with the memblock_insert_region function:

	if (base < end) {
		nr_new++;
		if (insert)
			memblock_insert_region(type, i, base, end - base,
					       nid, flags);
	}

Since we set insert to true in the first step, now memblock_insert_region will be called. memblock_insert_region has almost the same implementation that we saw when we inserted a new region to the empty memblock_type (see above). This function gets the last memory region:

struct memblock_region *rgn = &type->regions[idx];

and copies the memory area with memmove:

memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));

After this fills memblock_region fields of the new memory region base, size, etc. and increases size of the memblock_type. In the end of the execution, memblock_add_range calls memblock_merge_regions which merges neighboring compatible regions in the second step.

In the second case the new memory region can overlap already stored regions. For example we already have region1 in the memblock:

0                    0x1000
+-----------------------+
|                       |
|                       |
|        region1        |
|                       |
|                       |
+-----------------------+

And now we want to add region2 to the memblock with the following base address and size:

0x100                 0x2000
+-----------------------+
|                       |
|                       |
|        region2        |
|                       |
|                       |
+-----------------------+

In this case set the base address of the new memory region as the end address of the overlapped region with:

base = min(rend, end);

So it will be 0x1000 in our case. And insert it as we did it already in the second step with:

if (base < end) {
	nr_new++;
	if (insert)
		memblock_insert_region(type, i, base, end - base, nid, flags);
}

In this case we insert overlapping portion (we insert only the higher portion, because the lower portion is already in the overlapped memory region), then the remaining portion and merge these portions with memblock_merge_regions. As I said above memblock_merge_regions function merges neighboring compatible regions. It goes through all memory regions from the given memblock_type, takes two neighboring memory regions - type->regions[i] and type->regions[i + 1] and checks that these regions have the same flags, belong to the same node and that the end address of the first regions is not equal to the base address of the second region:

while (i < type->cnt - 1) {
	struct memblock_region *this = &type->regions[i];
	struct memblock_region *next = &type->regions[i + 1];
	if (this->base + this->size != next->base ||
	    memblock_get_region_node(this) !=
	    memblock_get_region_node(next) ||
	    this->flags != next->flags) {
		BUG_ON(this->base + this->size > next->base);
		i++;
		continue;
	}

If none of these conditions are true, we update the size of the first region with the size of the next region:

this->size += next->size;

As we update the size of the first memory region with the size of the next memory region, we move all memory regions which are after the (next) memory region one index backwards with the memmove function:

memmove(next, next + 1, (type->cnt - (i + 2)) * sizeof(*next));

The memmove here moves all regions which are located after the next region to the base address of the next region. In the end we just decrease the count of the memory regions which belong to the memblock_type:

type->cnt--;

After this we will get two memory regions merged into one:

0                                             0x2000
+------------------------------------------------+
|                                                |
|                                                |
|                   region1                      |
|                                                |
|                                                |
+------------------------------------------------+

As we decreased counts of regions in a memblock with certain type, increased size of the this region and shifted all regions which are located after next region to its place.

That's all. This is the whole principle of the work of the memblock_add_range function.

There is also memblock_reserve function which does the same as memblock_add, but with one difference. It stores memblock_type.reserved in the memblock instead of memblock_type.memory.

Of course this is not the full API. Memblock provides APIs not only for adding memory and reserved memory regions, but also:

  • memblock_remove - removes memory region from memblock;
  • memblock_find_in_range - finds free area in given range;
  • memblock_free - releases memory region in memblock;
  • for_each_mem_range - iterates through memblock areas.

and many more....

Getting info about memory regions

Memblock also provides an API for getting information about allocated memory regions in the memblock. It is split in two parts:

  • get_allocated_memblock_memory_regions_info - getting info about memory regions;
  • get_allocated_memblock_reserved_regions_info - getting info about reserved regions.

Implementation of these functions is easy. Let's look at get_allocated_memblock_reserved_regions_info for example:

phys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info(
					phys_addr_t *addr)
{
	if (memblock.reserved.regions == memblock_reserved_init_regions)
		return 0;

	*addr = __pa(memblock.reserved.regions);

	return PAGE_ALIGN(sizeof(struct memblock_region) *
			  memblock.reserved.max);
}

First of all this function checks that memblock contains reserved memory regions. If memblock does not contain reserved memory regions we just return zero. Otherwise we write the physical address of the reserved memory regions array to the given address and return aligned size of the allocated array. Note that there is PAGE_ALIGN macro used for align. Actually it depends on size of page:

#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)

Implementation of the get_allocated_memblock_memory_regions_info function is the same. It has only one difference, memblock_type.memory used instead of memblock_type.reserved.

Memblock debugging

There are many calls to memblock_dbg in the memblock implementation. If you pass the memblock=debug option to the kernel command line, this function will be called. Actually memblock_dbg is just a macro which expands to printk:

#define memblock_dbg(fmt, ...) \
         if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)

For example you can see a call of this macro in the memblock_reserve function:

memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
		     (unsigned long long)base,
		     (unsigned long long)base + size - 1,
		     flags, (void *)_RET_IP_);

And you will see something like this:

Memblock

Memblock also has support in debugfs. If you run the kernel on another architecture than X86 you can access:

  • /sys/kernel/debug/memblock/memory
  • /sys/kernel/debug/memblock/reserved
  • /sys/kernel/debug/memblock/physmem

to get a dump of the memblock contents.

Conclusion

This is the end of the first part about linux kernel memory management. If you have questions or suggestions, ping me on twitter 0xAX, drop me an email or just create an issue.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to linux-insides.