Last edited on 20231121.

A bootloader for CHERIoT

My previous notes on CHERIoT explained how to run programs on the Arty board. Unfortunately, testing different programs required synthetizing again the bitfiles each time, which could take more than 10 minutes.

Fortunately, David Chisnall has now provided a small bootloader that can load a hex file over UART.

These notes try to describe what the code does from my understanding.

Below, I reproduce the boot.S code with my comments.

.include "assembly-helpers.s"

Directive for including assembly-helpers.s which contains multiple useful macros.

    .section .text, "ax", @progbits
.zero 0x80
    .globl start
    .p2align 2
    .type start,@function

Declare a .text section which is read-only and executable. The section starts with 0x80 (128) bytes initialized to zero and makes the symbol start global in the symbol table. .p2align 2 aligns the current section to 2^2 = 4 bytes (32 bits) alignment. Finally, the last directive marks the symbol start as being a function name.

start:
	// ca0 (first argument) contains the read-write root
	cspecialr        ca0, mtdc

cspecialr cd, scr is an alias for cspecialrw cd, scr, c0. Thus, this instruction simply reads the register mtdc and copies its value in register ca0, assuming that the PCC has permit_access_system_registers enabled (which holds at reset, since PCC contains the executable root capability). mtdc is a special capability register that stands for machine trap data capability. More importantly, it contains the memory root capability at reset. CHERIoT extends RISC-V's general purpose registers to 65 bit length (64 bits + 1 bit tag). The registers are referred to as c0, ..., c15, while their integer (address) parts retain the standard x0, ..., x15 name. The RV32E ABI register names are reused, thus ca0 (a0) corresponds to capability register c10 (integer register x10).

	// Zero the tag memory
	li               a1, 0x200fe000
	csetaddr         ca0, ca0, a1
	li               a1, 0x20100000
	cjal              zero_memory

li is a pseudo-instruction that stands for Load Immediate, thus, this sequence of code writes the immediate 0x200fe000 into register a1, and sets the address of the capability contained in ca0 to it. 0x20100000 is then written to a1, and the routine zero_memory is called. Addresses 0x200fe000 to 0x20100000 correspond to the "shadow" memory used for the temporal safety mechanism and are zeroed by the routine.

	// No bounds on stack, grows down from the end of IRAM
	li               sp, 0x20080000
	csetaddr         csp, ca0, sp
	auipcc           cra, 0

We land back here when zero_memory returns. This loads 0x20080000 which is the end address of IRAM into the stack pointer (csp.address). Finally, the current PCC is saved into the capability return address register cra.

	// Call the C++ entry point
	la_abs           t0, rom_loader_entry
	csetaddr         cra, cra, t0
	cjalr            cra

la_abs is one of the macro available in assembly-helpers.s and loads the absolute address of the symbol. Hence, the first instruction will load the address of rom_loader_entry (from boot.cc) into register t0 (i.e., x5). The address is then used to modify the capability in cra and is then jumped to. More precisely, cjalr cra expands into cjalr cra, cra which seals the return address into cra, and replaces the PCC with the given capability. rom_loader_entry takes only one argument which is by convention contained in ca0. The capability was originally from mtdc and has all the necessary permissions and bounds. Its address is not used, and is reset by rom_loader_entry.

	// Zero all of the memory that we haven't loaded into.
	cspecialr        ca1, mtdc
	csetaddr         ca0, ca1, a0
	li               a1, 0x20080000
	cjal              zero_memory

After returning, a0 contains the end address of where code has been loaded into, and memory is zeroed from that address until 0x20080000 which is the end address of the IRAM.

	// Jump to the newly loaded binary.
	// This could be a relative jump, but I'd need to get the relocations right
	// and we have 32 KiB of IROM so wasting a few bytes doesn't really matter.
	auipcc           cra, 0
	li               t0, 0x20040000
	csetaddr         cra, cra, t0
	cjr              cra

This sets an executable root capability with address 0x20040000 which is where roam_loader_entry started writing the loaded code. cjr cra expands into cjalr cnull, cra, that is, the return address is ignored and only a jump is executed.

zero_memory:
	csw              zero, 0(ca0)
	cincoffset       ca0, ca0, 4
	blt              a0, a1, zero_memory;
	cret

The routine writes a zero at ca0.address in memory and increments a0 by 4 bytes (32 bits). If a0 is lesser than a1 then it continues zeroing the memory, otherwise it returns. Thus, the routine zeroes all memory between a0 (ca0.address) and a1. This zeroes memory 32 bits by 32 bits, but I wonder if it wouldn't be faster to zero 64 bits by 64 bits using csc c0, 0(ca0) instead, since, as far as I understand, the null capability is untagged, so this would also remove tags from memory (which are not guaranteed to be zeroed at reset).