Last edited on 20231121.
A bootloader for CHERIoT
My previous notes on CHERIoT explained how to run programs on the Arty board. Unfortunately, testing different programs required synthetizing again the bitfiles each time, which could take more than 10 minutes.
Fortunately, David Chisnall has now provided a small bootloader that can load a hex file over UART.
These notes try to describe what the code does from my understanding.
Below, I reproduce the
code with my comments.
Directive for including
which contains multiple useful macros.
.section .text, "ax", @progbits
.text section which is read-only and executable. The section starts
0x80 (128) bytes initialized to zero and makes the symbol
in the symbol table.
.p2align 2 aligns the current section to 2^2 = 4 bytes
(32 bits) alignment. Finally, the last directive marks the symbol
being a function name.
// ca0 (first argument) contains the read-write root
cspecialr ca0, mtdc
cspecialr cd, scr is an alias for
cspecialrw cd, scr, c0. Thus, this instruction simply reads the register
and copies its value in register
ca0, assuming that the PCC has
permit_access_system_registers enabled (which holds at reset, since PCC contains
the executable root capability).
mtdc is a special capability register
that stands for machine trap data capability. More importantly, it contains the
memory root capability at reset. CHERIoT extends RISC-V's general
purpose registers to 65 bit length (64 bits + 1 bit tag). The registers are
referred to as
c0, ..., c15, while their integer (address) parts retain the
x0, ..., x15 name. The RV32E ABI register names are reused, thus
a0) corresponds to capability register
c10 (integer register
// Zero the tag memory
li a1, 0x200fe000
csetaddr ca0, ca0, a1
li a1, 0x20100000
li is a pseudo-instruction that stands for Load Immediate, thus, this sequence
of code writes the immediate
0x200fe000 into register
a1, and sets the address
of the capability contained in
ca0 to it.
0x20100000 is then written to
and the routine
zero_memory is called. Addresses
correspond to the "shadow" memory used for the temporal safety mechanism
and are zeroed by the routine.
// No bounds on stack, grows down from the end of IRAM
li sp, 0x20080000
csetaddr csp, ca0, sp
auipcc cra, 0
We land back here when
0x20080000 which is the end address of IRAM into the stack pointer
csp.address). Finally, the current PCC is saved into the capability return
// Call the C++ entry point
la_abs t0, rom_loader_entry
csetaddr cra, cra, t0
la_abs is one of the macro available in
and loads the absolute address of the symbol. Hence, the first instruction will
load the address of
x5). The address is then used to modify the capability
cra and is then jumped to. More precisely,
cjalr cra expands into
cjalr cra, cra
which seals the return address into
cra, and replaces the PCC with the given
rom_loader_entry takes only one argument which is by convention
ca0. The capability was originally from
mtdc and has all the
necessary permissions and bounds. Its address is not used, and is reset by
// Zero all of the memory that we haven't loaded into.
cspecialr ca1, mtdc
csetaddr ca0, ca1, a0
li a1, 0x20080000
a0 contains the end address of where code has been loaded into,
and memory is zeroed from that address until
0x20080000 which is the end address
of the IRAM.
// Jump to the newly loaded binary.
// This could be a relative jump, but I'd need to get the relocations right
// and we have 32 KiB of IROM so wasting a few bytes doesn't really matter.
auipcc cra, 0
li t0, 0x20040000
csetaddr cra, cra, t0
This sets an executable root capability with address
0x20040000 which is where
roam_loader_entry started writing the loaded code.
cjr cra expands into
cjalr cnull, cra, that is, the return address is ignored and only a jump is
csw zero, 0(ca0)
cincoffset ca0, ca0, 4
blt a0, a1, zero_memory;
The routine writes a zero at
ca0.address in memory and increments
a0 by 4
bytes (32 bits). If
a0 is lesser than
a1 then it continues zeroing the memory,
otherwise it returns. Thus, the routine zeroes all memory between
a1. This zeroes memory 32 bits by 32 bits, but I wonder if it wouldn't be
faster to zero 64 bits by 64 bits using
csc c0, 0(ca0) instead, since, as far
as I understand, the null capability is untagged, so this would also remove tags
from memory (which are not guaranteed to be zeroed at reset).