# Tutorial 10 - DMA Memory There's a secret I haven't told you! A certain part of our code doesn't work anymore since the [virtual memory](../0D_virtual_memory) tutorial. There is a regression that manifests in the `Videocore Mailbox` driver. It will only work until **paging and caching** is switched on. Afterwards, the `call()` method will fail. Why is that? The reason is that in our code, the RPi's processor is sharing a `DRAM buffer` with the `Videocore` device. In other words, the concept of **shared memory** is used. Let's recall a simplified version of the protocol: 1. RPi `CPU` checks the `STATUS` MMIO register of the `Videcore` if a message can be written. 2. If so, `CPU` writes the address of the `DRAM buffer` in which the actual message is stored into the `Videocore`'s `WRITE` MMIO register. 3. `CPU` checks the `STATUS` and `READ` MMIO registers if the Videocore has answered. 4. If so, `CPU` checks the first `u32` word of the earlier provided `DRAM buffer` if the response is valid (the `Videocore` puts its answer into the same buffer in which the original request was stored. This is what is commonly called a `DMA` transaction). At step **4**, things break. The reason is that code and **page tables** were set up in a way that the `DRAM buffer` used for message exchange between `CPU` and Videcore is attributed as _cacheable_. So when the `CPU` is writing to the buffer, the contents might not get written back to `DRAM` in time before the notification of a new message is signaled to the Videocore via the `WRITE` MMIO register (which is correctly attributed as device memory in the page tables and hence not cached). Even if the contents would land in `DRAM` in time, the `Videocore`'s answer which overwrites the same buffer would not be reflected in the `CPU`'s cache, since there is no coherency mechanism in place between the two. The RPi `CPU` would read back the same values it put into the buffer itself when setting up the message, and not the `DRAM` content that contains the answer. ![DMA block diagram](../doc/dma_0.png) The regression did not manifest yet because the Mailbox is only used before paging and caching is switched on, and never afterwards. However, now is a good time to fix this. ## An Allocator for DMA Memory The first step is to introduce a region of _non-cacheable DRAM_ in the `KERNEL_VIRTUAL_LAYOUT` in `memory.rs`: ```rust Descriptor { name: "DMA heap pool", virtual_range: || RangeInclusive::new(map::virt::DMA_HEAP_START, map::virt::DMA_HEAP_END), translation: Translation::Identity, attribute_fields: AttributeFields { mem_attributes: MemAttributes::NonCacheableDRAM, acc_perms: AccessPermissions::ReadWrite, execute_never: true, }, }, ``` When you saw the inferior performance of non-cacheable mapped DRAM compared to cacheable DRAM in the [cache performance tutorial](../0E_cache_performance) earlier and asked yourself why anybody would ever want this: Exactly for the use-case at hand! Theoretically, some linker hacks could be used to ensure that the `Videcore` is using a buffer that is statically linked to the DMA heap pool once paging and caching is turned on. However, in real-world kernels, it is common to frequently map/allocate and unmap/free chunks of `DMA` memory at runtime, for example in device drivers for DMA-capable devices. Hence, let's introduce an `allocator`. ### Bump Allocation As always in the tutorials, a simple implementation is used for getting started with basic concepts of a topic, and upgrades are introduced when they are needed. In a `bump allocator`, when a requests comes in, it always returns the next possible aligned region of its heap until it runs out of memory. What makes it really simple is that it doesn't provide means for freeing memory again. When no more memory is left, game is over. Conveniently enough, [Rust already provides memory allocation APIs](https://doc.rust-lang.org/alloc/alloc/index.html). There is an [Alloc](https://doc.rust-lang.org/alloc/alloc/trait.Alloc.html) and a [GlobalAlloc](https://doc.rust-lang.org/alloc/alloc/trait.GlobalAlloc.html) trait. The latter is intended for realizing a _default allocator_, meaning it would be the allocator used for any standard language construtcs that automatically allocate something on the heap, for example a [Box](https://doc.rust-lang.org/alloc/boxed/index.html). There can only be one global allocator, so the tutorials will make use of it for cacheable DRAM later. Hence, for the DMA bump allocator, [Alloc](https://doc.rust-lang.org/alloc/alloc/trait.Alloc.html) will be used. What is also really nice is that for both traits, only the `alloc()` method needs to be implemented. If this is done, you automatically get a bunch of additional default methods for free, e.g. `alloc_zeroed()`. Here is the implementation in `memory/bump_allocator.rs`: ```rust pub struct BumpAllocator { next: usize, pool_end: usize, name: &'static str, } unsafe impl Alloc for BumpAllocator { unsafe fn alloc(&mut self, layout: Layout) -> Result, AllocErr> { let start = crate::memory::aligned_addr_unchecked(self.next, layout.align()); let end = start + layout.size(); if end <= self.pool_end { self.next = end; println!( "[i] {}:\n Allocated Addr {:#010X} Size {:#X}", self.name, start, layout.size() ); Ok(NonNull::new_unchecked(start as *mut u8)) } else { Err(AllocErr) } } // A bump allocator doesn't care unsafe fn dealloc(&mut self, _ptr: NonNull, _layout: Layout) {} } ``` The `alloc()` method returns a pointer to memory. However, it is safer to operate with [slices](https://doc.rust-lang.org/alloc/slice/index.html), since they are intrinsically bounds-checked. Therefore, the `BumpAllocator` gets an additional method called `alloc_slice_zeroed()`, which wraps around `alloc_zeroed()` provided by the `Alloc` trait and on success returns a `&'a mut [T]`. ### Global Instance A global instance of the allocator is needed, and since its methods demand _mutable references_ to `self`, it is wrapped into a `NullLock`, which was introduced in the [last tutorial](../0F_globals_synchronization_println): ```rust /// The global allocator for DMA-able memory. That is, memory which is tagged /// non-cacheable in the page tables. static DMA_ALLOCATOR: sync::NullLock = sync::NullLock::new(memory::BumpAllocator::new( memory::map::virt::DMA_HEAP_START as usize, memory::map::virt::DMA_HEAP_END as usize, "Global DMA Allocator", )); ``` ## Videocore Driver The `Videocore` driver has to be changed to use the allocator during instantiation, and in contrast to earlier, this could fail now: ```rust let ret = crate::DMA_ALLOCATOR.lock(|d| d.alloc_slice_zeroed(MBOX_SIZE, MBOX_ALIGNMENT)); if ret.is_err() { return Err(()); } ``` ## Reorg of the Kernel Init Since the `Videcore` now depends on the `DMA Allocator`, its initialization must now happen _after_ the `MMU init`, which turns on **paging and caching**. This, in turn, means that the `PL011 UART`, which is used for printing and needs the `Videcore` for its setup, has to shift its init as well. So there is a lot of shuffling happening. In summary, the new init procedure would be: 1. GPIO 2. MMU 3. Videcore 4. PL011 UART That is a bit unfortunate, because if anything goes wrong at `MMU` or `Videocore` init, we can not print any fault info on the console. For this reason, the `MiniUart` from the earlier tutorials is revived, because it only needs the `GPIO` driver to set itself up. So here is the revamped init: 1. GPIO 2. MiniUart 3. MMU 4. Videcore 5. PL011 UART Using this procedure, the `MiniUart` can report faults for any of the subsequent stages like`MMU` or `Videocore` init. If all is successful and the more capable `PL011 UART` comes online, we can let it conveniently replace the `MiniUart` through the `CONSOLE.replace_with()` scheme introduced in the [last tutorial](../0F_globals_synchronization_println). ### Make it Fault If you feel curious and want to put all the theory to action, take a look at the code in `main.rs` for the DMA allocator instantiation and try the changes in the comments: ```rust /// The global allocator for DMA-able memory. That is, memory which is tagged /// non-cacheable in the page tables. static DMA_ALLOCATOR: sync::NullLock = sync::NullLock::new(memory::BumpAllocator::new( memory::map::virt::DMA_HEAP_START as usize, memory::map::virt::DMA_HEAP_END as usize, "Global DMA Allocator", // Try the following arguments instead to see the PL011 UART init // fail. It will cause the allocator to use memory that is marked // cacheable and therefore not DMA-safe. The communication with the // Videocore will therefore fail. // 0x00600000 as usize, // 0x007FFFFF as usize, // "Global Non-DMA Allocator", )); ``` This might only work on the real HW and not in QEMU. ## QEMU On the actual HW it is possible to reprogram the same `GPIO` pins at runtime to either use the `MiniUart` or the `PL011`, and as a result the console output of both is sent through the same USB-serial dongle. This is transparent to the user. On QEMU, unfortunately, two different virtual terminals must be used and this multiplexing is not possible. As a result, you'll see that the QEMU output has changed in optics a bit and now provides separate views for the two `UARTs`. ## Output ```console ferris@box:~$ make raspboot [0] MiniUart online. [1] Press a key to continue booting... Greetings fellow Rustacean! [2] MMU online. [i] Kernel memory layout: 0x00000000 - 0x0007FFFF | 512 KiB | C RW PXN | Kernel stack 0x00080000 - 0x00083FFF | 16 KiB | C RO PX | Kernel code and RO data 0x00084000 - 0x0008700F | 12 KiB | C RW PXN | Kernel data and BSS 0x00200000 - 0x005FFFFF | 4 MiB | NC RW PXN | DMA heap pool 0x3F000000 - 0x3FFFFFFF | 16 MiB | Dev RW PXN | Device MMIO [i] Global DMA Allocator: Allocated Addr 0x00200000 Size 0x90 [3] Videocore Mailbox set up (DMA mem heap allocation successful). [4] PL011 UART online. Output switched to it. $> ```