14/05/2025

Barium No-Boot (5): from UP to SMP: go multi-core!

Intro

Well, we have started our SoC, configured necessary PLLs and clocks, setup minimal amount of hardware blocks — GPIO, UART, made it output information — simple debug interface, enabled caches and learned how to set the ALU speed. In the first post of this series we've reviewed benefits of learning modern ARMv8-A machine. Among others, we've mentioned one significant, important thing to learn these days — multi-core ALUs. In this post we will start up all cores of our SoC (iMX8MP) and run independent code on each of them.

So what we have at this moment, at Stage II? An educational platform that has entry points for working in C and assembly language (to learn ARMv8 architecture) and works in single-core mode. Today we want it to be reworked into multi-core platform, which, in addition to meeting the same technical requirements — entry points for working in C and assembly language, will be able to execute separate code for each ALU core — we want to go SMP!

Redesign

We are talking about SMP feature today, but core start-up routine is not that big compared to the rework we have to do this time. Thus, we'll cover it later and start our lesson with (almost complete) project redesign to comply multi-core development, which is actually the biggest part of the work to go multi-core.

We will go from top to bottom level because it is easier to explain and understand today's work. We'll redesign our main C-function (barium.c), C-runtime (crt0_64.s), modify startup code (start_64.s) and even linker script (barium.lds).

C-code

Let's start from the top level — payload C-code organization of our main C-file — barium.c. We want a multi-core platform that has entry points in C (as well as in assembly language, but we'll cover this later) for each ALU core. From the architecture point of view, we could implement a separate routine for each ALU — from barium_main_0() to barium_main_N() and it would look pretty clear code to a human. But this would look more like asymmetric multiprocessing programming (AMP). Thus, we'll keep single main function consisting of parts of common for all cores code and parts of separate code for each of the cores the function is executed by. This is to know how does it feel — writing parallelly executed code for multi-core environment. Hence, barium_main() will stay in the project and get a new parameter — _CoreNo, to determine core number it is running on.

When we were working in the single-core mode, some ALU was running and initialized the entire system and executed our payload code. That was zeroth ALU. This time we'll leave this part of project intact — zeroth ALU will initialize the system, including new steps — starting other ALUs. After that, just like it was at previous stage, it will fall into measure loop. Third core will perform reading from and outputting to UART loop. The remaining two will fall in dead loop at this stage.

Let's say we want each ALU to output information about itself, like our zeroth did at previous stages — it will be interesting to review our initialization routine and some aspects of CPU start-up process. But we can assume that we will get a mess in the UART if we start all cores in parallel. And, believe me — we will have a mess of symbols in UART. This is because we do not have any synchronization and/or locking mechanisms yet. Hence, how should we output information about (and from) each ALU core that runs this code almost simultaneously? We'll do some desynchronization. For that reason we will have an array of close but different values for each core — uint32_t desync_loops []:


   uint32_t desync_loops [] = { 0, 250000, 350000, 450000 };

These values are empty loops which every core will perform before proceeding to its payload, specifically to outputting initial information via UART. There is no statistics or calculation behind that values — I've just choose them by experiment. But this idea gives us correct initial output from all cores: from zeroth to third. Of course output is done once by each core, except the third, which continues to use UART in both directions in an infinite loop after outputting initial information.

So, our barium_main() function performs desynchronization loop and after that, in case of zeroth ALU core, executes newly added init_board() routine which initializes the hardware of our SoC (PLL, clocks, GPIO, UART), sets ALU speed, outputs general information about Barium and starts remaining three cores. We have a new routine for the last step:


    void start_cpu(uint32_t _Core, uintptr_t _EntryPoint);

And use it as shown below:


    start_cpu(1, (uintptr_t)_crt0_main);
    start_cpu(2, (uintptr_t)_crt0_main);
    start_cpu(3, (uintptr_t)_crt0_main);

Where entry_point is declared as:


    extern void _crt0_main(void);

After that goes some part of code which is executed by all ALU cores (including zeroth) — output information about itself. And, finally, there are two parts of code. One for zeroth — generating of measure impulses, and one for third — cycle that outputs symbols read from UART.

That's all here. You can see the rest in the code. start_cpu() routine itself will be reviewed later.

C-runtime

Now let's dive a little deeper — next level is C-runtime, our crt0_64.S file. Regarding machine code — program itself and program writing process, all ALU cores of homogeneous multi-core system (or all homogeneous ALU cores of heterogeneous system) work in the same and absolutely identical way. Do you remember what is C-runtime for? C-code uses stack and there is C-runtime for setting Stack Pointer (SP). Stack is used by any separate execution flow — be it a code running on a whole ALU core, or a separate process — so-called «task», or «thread» which runs in a time slot given to it. And any execution flow needs a separate stack because it is used independently. SP is a per-core, so-called special purpose register. That means that today we have to do the same thing in C-runtime as we did at previous stages — prepare our system, but in this case — each of its cores, for C-code. Thus, we have to set stack pointer for each ALU core. And, of course, this must be done in assembly language.

But how can we organize as many as four stacks? Let's just put our remaining three stacks below zeroth. Let's say each ALU core will have stack of equal size. Hence, the formula for each SP will be: Stack Base Top - (Stack Size * Core No.).

Maybe, you've noticed that extension of crt0 assembly file is capitalized now — .S. Capitalized extension of source file means that this file must be passed through preprocessor before compilation. Yes, we want our code to be a little configurable via environment variables at this stage, in particular, we want stack top and stack size to be configurable.

As shown above, we use _crt0_main as entry point for all our newly powered up cores. And here is our SP calculation routine:


    # Get core number from MPIDR and store it as third parameter:
    mrs x2, MPIDR_EL1
    and x2, x2, 0xFF
    
    # Set stack pointer (per core) saving it as first parameter to
    # barium_main function. Calculate stack pointer for current core.
    # SP = SP base top - (core stack size * core No.):
    mov x5, STACK_SIZE
    mov x6, SP_BASE_TOP
    mul x1, x2, x5
    sub x1, x6, x1

ARMv8 has a per-core system register called Reset Vector Base Address Register (RVBAR). Let's get it to fourth parameter of barium_main() to examine later:


    # Store the RVBAR as fourth parameter:
    mrs x3, RVBAR_EL3

After setting stack pointer we are ready for C-code and branch to our redesigned barium_main() routine. This is how our assembly language multi-core entry point routine looks like — we can add assembly code depending on core number (x2 register in our case). Now we can perform different routines depending on core number in assembly language. That's all we have to do at this level.

Startup

Let's proceed to the most low-level startup routine — _start function. There is not too much to rework here — the smallest part of work, just enable SMP feature of ARMv8:


    # Load CPU Extended Control Register:
    mrs x4, S3_1_C15_C2_1
    # Enable SMP:
    orr x4, x4, #(1 << 6)
    # Set CPU Extended Control Register:
    msr S3_1_C15_C2_1, x4

That's all we have to do at this level today. As for the code the project redesign is completed. But for human it's hard to perceive architecture of multi-core application which is described in text and written in code in a direct, single-flow manner. To get better overview for human, we'll provide the architecture scheme which represents routines executed by all cores and order of that routines. See below (click on the image to enlarge it):

Barium multi-core architecture

Linker script

As we are passing entry point address to our start_cpu() routine, we have to be able to get this address at runtime. Who (or what) is responsible for address resolving? It's linker — it collects all symbols from all object files being linked and resolves all addresses in resulting executable file. We are working on bare-metal, and each machine (SoC and it's BROM) have different addresses it loads executable to. Hence, we have to inform the linker about memory layout of our application and, specifically, where the memory starts — this will be the base address of our .text section. Also, we have to describe this section. If we omit memory layout description, linker will assume that memory (and our code) starts from zero (which would be wrong on, maybe all modern machines) and the address will be resolved wrongly. As we want our project to be a little configurable, we will pass our linker script through preprocessor (as we did with C-runtime assembly language file). This is to be able to specify memory start address via environment. We should always specify memory start equal to the address BROM loads our image to. And we already have that variable in our environment — IMAGE_LOAD_ADDR. We've used it to build boot image header at previous stages. This time we will add description of our memory (OnChip RAM) to linker script:


    MEMORY
    {
        .OCRAM : ORIGIN = IMAGE_LOAD_ADDR,
        LENGTH = (576 * 1024)
    }

And describe the location of .text section:


    .text (READONLY) :
    {
        . = ALIGN(8);
        *(.text*)
    } >.OCRAM

Linker does not pass its script to preprocessor regarding of the file name. We have to do it ourselves. Thus, we rename our linker script to .LDS and pass this file through preprocessor manually. Below is the rule to make .lds from .LDS (Makefile).


    $(PRODUCT).lds:
        @$(CPP) -P -DIMAGE_LOAD_ADDR=$(IMAGE_LOAD_ADDR) $(PRODUCT).LDS -o $(PRODUCT).lds

So now our project redesign is complete. Our build system compiles files with compiler, linker resolves addresses into correct values, _start enables SMP feature on ARMv8, _crt0_main sets four stack pointers for each ALU, barium_main() initializes the SoC, starts all ALUs with entry points. The new stage is set up and ready to be run.

ALU start-up

The only thing left to overview today is the most platform/hardware-specific routine — starting ALU cores. This is done in gpc.c/gpc.h files (General Power Controller). Here we perform ALU start-up sequence. This routine is not architectural and is pretty straight-forward, thus, we'll not bother too much with it. You'll see that the first step is to set entry point address for core. After that goes the reset and power-up routine. Lines of code have comments referencing page numbers of datasheet.

This is how we go from UP (uniprocessor) to SMP (symmetric multi processor) on iMX8MP SoC.

The results

It's time to play, get results, review them and make conclusions. Let's see what we get:


    Barium No-Boot V0.3 (iMX8MP)
    Build: 03:22:03, May 13 2025
    Running at: 2200MHz

    ALU Core №: 0
    Reset VBAR: 0000000000000000
    Initial PC: 0000000000920020
    Initial SP: 0000000000920000
    Current SP: 000000000091FFC0

    ALU Core №: 1
    Reset VBAR: 0000000000920020
    Initial PC: 0000000000920020
    Initial SP: 000000000091F000
    Current SP: 000000000091EFC0

    ALU Core №: 2
    Reset VBAR: 0000000000920020
    Initial PC: 0000000000920020
    Initial SP: 000000000091E000
    Current SP: 000000000091DFC0

    ALU Core №: 3
    Reset VBAR: 0000000000920020
    Initial PC: 0000000000920020
    Initial SP: 000000000091D000
    Current SP: 000000000091CFC0

    Awaiting symbols from UART:

We see four similar outputs from each core. First, let's check how our stack pointers are calculated and set: 920000h for zeroth core, 91F000h for first, 91E000h for second and 91D000h for third. So, everything is fine here, our stack pointers are calculated correctly down from 920000h with the step of 1000h for each core. Also notice that last 12 bits of current SP of each ALU are the same — FC0h. That's because we read current SP in the same code for each ALU and the «consumption», or use, of stack (up to the line where we get it) is identical for each core in our project. Thus, the memory map of our application looks like (build with default values of IMAGE_LOAD_ADDR and STACK_SIZE):

Barium memory map

We've also moved reading of initial PC from _start to _crt0_main. That was done to be able to get it on all cores. Because _start is executed by zeroth core only and remaining cores would not run that code. So, now initial PC is not address off the first instruction of our application. And that's why it is not equal to IMAGE_LOAD_ADDR anymore, but address of ALUs entry point instead.

What else is remarkable here? We've added _RVBAR parameter to our barium_main() function. You may have already noticed, that cores from the first to the third have the same Reset VBAR — 920020h. That is actual address of out entry point (_crt0_main) for that cores. But zeroth core has RVBAR of all zeros. That's how it is set for zeroth core by hardware and the address where BROM is located.

Another interesting point. You could notice, that in previous posts, when we've started zeroth ALU and configured it, the «Initial SP», we've output, was the stack pointer value before we've set it — stack pointer value that was set by BROM. But this time we output the value we've calculated. You can be curious what are the values of stack pointer of other ALUs. I wondered about that too, and checked it. The reason why this functionality is omitted in this post is that initial stack pointer values for remaining ALUs are actually garbage.

That's all for today. You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage III» directory).

12/03/2025

Barium No-Boot (4): measurements, rev up iMX8MP

In previous posts we've made minimal but enough for studying and playing basic application which is able to output debug information. It sets a lot of PLLs and clocks except, maybe, for most of us, the most interesting one — ARM PLL, or PLL which controls the speed of ALU. And we will do such research in this part. The announce of this post is partial, this is on a purpose — it will be more interesting for you to read without knowing the complete plan of this post.

The research plan

We can't measure ALU clock directly because it has no outputs outside of SoC. So, how should we act if we want to know if it is at least changed? We need to do something that depends on ALU speed, and what we can measure. What could it be? UART? No. We configure UART to output data on a particular baud rate, but that has nothing in common with ALU speed. UART configuration affects it's TX and RX pins only. Thus, we need different «interface» with SoC which can give us possibility to measure it's clock. Let's turn to theory for a little. So-called ALU «cycle» is one tick of ALU clock, but single clock tick is not equal to instruction, instruction may take from one to thousands of cycles. (Instruction that take zero cycles are out of scope of this post). But we know that any particular instruction takes the same amount of cycles at any ALU speed. Thus, real execution time of any block of instructions will directly vary on ALU speed. We just need to have some measurable points of some «reference» block of instructions. The most obvious block of instructions is empty infinite loop which translates into unconditional branch back to itself even leaving link register intact. What could be the rest of our setup? The most clear way to measure something on a SoC is GPIO. Hence, in this post we will initialise GPIO controller, one of it's banks, one of pads of this bank and put GPIO toggling function into an infinite loop. After that we'll be able to measure impulses with a scope, change ALU PLL speed, leaving reference code intact and measure impulses again.

As we've reviewed the workflow on embedded systems without any BSP, bootloader, kernel — bare-metal, in previous posts, I'll not provide such details in this and further parts. You will see what and how is done in code. Some code lines have comments referencing datasheet page numbers. If you'll ever need to change GPIO number, for example, you can look datasheet around pages numbers that you'll see in code.

The practice

Today we will add two files to our project — gpio.c and gpio.h. I have a handy GPIO on my board — GPIO number 16 on bank 4. So, we add GPIO initialisation and set it's direction to barium_main() function:


/* Init GPIO and set 16th of fourth bank as output: */ gpio_init(); gpio_set_dir(GPIO_BANK4, 16, GPIO_OUT);

After that we add our measurement code — the reference code we have mentioned earlier:


/* Generate impulses for measurements: */ loop_meas: gpio_toggle_val(GPIO_BANK4, 16); for (lVal = 0; lVal < IMPULSE_LEN; lVal++); goto loop_meas;

IMPULSE_LEN can be any, but I've chosen 1000000 to fit the loop to comfortable scope view settings.

Build, write to SD-card, connect scope, start the board and see:


Here we see that period is 568ms width. The period width is uninformative itself. This is just how many time our SoC spends executing our reference cycle on the default ARM PLL settings. From «PLL setting by ROM» we remember that ARM PLL is set at 1GHz. Let's reconfigure it and see what measurements we'll have. ARM cores of iMX8MP are driven by so-called PLL1416x. I could not find any documentation describing it, so we will cope with SoC's datasheet (ARM PLL), see the registers we need to write values to and generate the exact values using the sheet I've managed to create from information I've gathered from around. You'll see the sheet file in the repository and can play with it. The methodology is simple— you change the Main divider (the biggest number in the most left column) with step of 25 and see column «Clock (MHz)». The rest will be done by the sheet's formulas. The table will display register value in hex and dec, and two values — Pre div and Scaler. We can configure ARM PLL by using three values — Main div, Pre div and Scaler in formula, or just write the exact value from the most right columns.

The whole code is added to file clocks.c/clocks.h. And I utilise three values formula instead of a exact single value. That's because I used to do it this way while I was on my researches. Let's rev our SoC up to 1.8GHz — it's claimed maximum (p. 95). According to sheet we will use 225, 3, 0 coefficients:


raw_writel((225 << 12) | (3 << 4) | (0), CCM_ANALOG_ARM_PLL_FDIV_CTL0);

Build, write to SD-card, start the board and see:

 

And what we see here? 500ms. The period has narrowed, but not as we expected — 568ms@1GHz expected to be 568/1.8 what is 316ms@1.8GHz! What is going on here? To make long story short (it still is not the most interesting part of it) I'll tell that I've measured different speeds and all periods appeared to be non-proportional to ARM clock. The short summary is shown in following table. I've made a little calculations also:

We see clocks we set, periods we have measured and two rows as values should be if reference (real true value) was at the slowest or at the fastest clock. Being calculated in any direction the calculated values have linear relation with clock speed. But measured are non-linear and look like like slightly related to clock speed at all. The table displays that something is really unclear or even wrong.

What could go wrong? NXP PLL1416x is undisclosed, thus, we don't know how it exactly works — we can't be sure what speeds it really sets. Another reason could lead to such results — our reference code was modified. But we can't examine PLL and we are sure our code stays intact between builds. So let's try to find some different cause of such ARM ALU behavior. We remember that at this stage ALU works on OCRAM — the slowest one of all RAMs. Let's assume that nonlinearity of our clock speed and period measurements is caused by OCRAM. What can we do now? We can use caches. According to specification iMX8MP has 32kB of instruction cache and the same amount of data cache. Therefore our tiny 1-2kB application will fit in SoC's cache entirely. Let's turn on both caches by adding this code to our start_64.s file:


# Load System Control Register (EL3): mrs x3, SCTLR_EL3 # Turn on Data Cache: orr x3, x3, #(1 << 2) # Turn on Instruction Cache: orr x3, x3, #(1 << 12) # Set System Control Register (EL3): msr SCTLR_EL3, x3

Let's check what we get on default 1GHz now:

The changes appeared to be so dramatic, that I had to tune my scope to make pretty-looking picture! Now our reference code executes in 34ms at 1GHz instead of 568ms with caches turned off at the same ALU clock speed! And here is new resulting table, showing that not only is everything faster, but it is also linear in both directions:

Now everything is in order and clear, the hypothesis was confirmed — OCRAM timings was messing ARM ALU speeds. But there is more to explore. As I mentioned earlier, we have a tool for calculating ARM PLL speeds — the spreadsheet. And maybe you noticed that picture above («iMX8MP ARM PLL Settings») shows some clock speeds far above of the standard 1.8GHz. Yes, I have checked that and here is what result I've got:

The table contains columns with 1.2GHz and 1.6GHz and this is on a purpose. The thing is that NXP's documentation claims that even 1.8GHz is so-called «overdrive» of this SoC's ALU. And my Linux system (NXP's BSP) confirms that — it works on two speeds only — 1.2GHz and 1.6GHz. I've measured both standard speeds and the highest I've managed to make SoC to work on to show the difference between really used clock speed and maximum I've got out of this SoC's ALU.

The result


Barium No-Boot V0.2 (iMX8MP) Build: 22:00:00, Mar 12 2025 Initial PC: 0000000000920000 BootROM SP: 0000000000916ED0 Current EL: 0000000000000003 Running at: 2200MHz

In previous posts we were reading, planning, preparing, learning, today we've learned how to use GPIO on this SoC, how to configure ARM PLL speed. But today is a special day — we've some nice and exceptional result, which is a real practical result — we've significantly revved up iMX8MP. Disclosing undocumented features is another one benefit of bare-metal studying (or exploring).

You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage II» directory).

12/02/2025

The real «Hello World» from embedder (3): result, play — Barium No-Boot!

Finally...

Well, we've discussed goals and benefits of bare-metal development, have gathered all the information we need to work on a specific SoC, and are ready to start writing code. In this post, we will practice. We will go thru the whole process of bare-metal coding, compiling, linking and, finally, will get a real stand-alone application for a real specific ARM64/ARMv8-A SoC. Can you imagine  it will be less than 1kB in size (including all necessary headers)! But this tiny application as a result, will perform the minimal task — output strings via UART. And even more — it will read from UART and output those bytes! Also we will add some functionality to our application, do some interesting stuff and perform some experiments.

The plan

The plan we need to carry out to reach the final result looks like:
1. Organise our code
— .s and .c files.
2. CPU start code
— assembly code.
3. Initialise the stack pointer
— assembly code.
4. Initialise PLLs, clocks
, pads, UART hardware-block — C code.
5. Write some payload code
— write to and read from UART.
6. Compile and link our code into binary file.
7. Make our binary acceptable by our SoC as boot-code.
8. Put our boot-block in a proper place.
9. Configure the board for booting from media we need.
10. Power on and play!

Hands-on code

Let's get it started.

Organise our code — .s and .c files.

This is bare-metal, we don't have bootloader and kernel behind us. Thus, there are some routines we have to do in assembly language prior any C-code. So, we will have assembly files. Actually two. They can be combined into a single .s-file, but we will stick to traditions. These files are (common names on 64-bit platforms) — start_64.s and crt0_64.s. But what exactly should be done in assembly files (language) on bare-metal? The answer is the things we can't do in C. We can access registers of hardware-blocks via volatile pointers as this is done via so-called memory-mapped input/output. But we can't access ALU registers in C directly, thus start_64.s file contains the very first code that cannot be done in C. Specifically CPU start-up code — applying erratas, switching modes, exception levels, setting interrupt vector tables, etc. Now let's go on with crt0. You probably are familiar with it (or, at least, heard of it) — so called «C-runtime zero». In user-space, usually, this is some object file, which is implicitly added to our code by linker prior to file containing main() function written in any high-level language. In user-space crt0 prepares argv and argc which are passed to developer's main() function, makes pointers to environment variables code starts with, etc. As you can assume, this file is unnecessary if we are about to write in assembly language only. Its content is to prepare the environment (be it a user-space program or bare-metal code) for C-code, in our case — set stack pointer to some valid address. After all necessary assembly code routines are done, we can break out to C-code. Today everything we've discussed in previous post will be done in C-code. Assembly language files are for startup code and stay as the window to ARMv8 architecture, our sandbox — one of our goals.

CPU start code — assembly code.

CPU start code. In this example we don't need to do anything here, but let's save some values for further research — (believe me) it'll be interesting to explore. According to ABI, x0 will be the first parameter of called function, x1 the second, x2 — the third. Let's save initial program counter, stack pointer and exception level while they are intact — at the very beginning of our code. We will save initial PC, SP and EL by storing them in x0, x1 and x2 registers. Later you will see how they will get into our main C-function and will be output to UART. So, start_64.s will look like:


.arch armv8-a .globl _start _start: # Store the address of the very first instruction. This will be the # address BROM puts our code to. According to ABI x0 will be the # first parameter of called function. Save current PC to x0: adr x0, . # Store initial stack pointer address as the second parameter: mov x1, sp # Save Current EL to x2, and it will get to third parameter: mrs x2, CurrentEL # Branch to crt0's main: b _main

Initialise the stack pointer — assembly code.

In our case, there is the only thing we have to do in crt0 to get prepared for C-code — set stack pointer. Stack pointer is set by just writing memory address value to SP register. We want our x0, x1 and x2 registers to pass thru this file to C-function, so we avoid using these registers here. So, crt0_64.s will look like:


.arch armv8-a .globl _main _main: # Set stack pointer at the top of OCRAM 97FFFFh (p.706): # Move it to x3 without last Fh (aligned to 16): mov x3, 0xFFF0 movk x3, 0x0097, LSL #16 # Set stack pointer: mov sp, x3 # Finally, break out to C-code: b barium_main

Initialise PLLs, clocks, pads, UART hardware-block — C code.

After that, we have to initialise PLLs, clocks, pads and UART controller. I suppose the way we should do this on our SoC is explained well enough in previous post, so I will not spend time on describing it further here. You will see it in the code.

Write some payload code — write to and read from UART.

Payload code. Same as p.4 — since we have initialised all the hardware and have some high-level functions, the code becomes self-explanatory. The only thing I want to explain is the uart_chars_counter and UART_FIFO_MAX. On each output we increment special counter and when it reaches maximum length of FIFO we wait for real transfer to finish by waiting for TXDC bit in UART Status register.

Compile and link our code into binary file.

Compile and link. Compilation and even cross-compilation is not something new — we compile separate source files, link them in a single ELF-file then dump pure code of it into binary file. You'll see a linker script in this repository. Actually, you can build the project without it. I've used it just to remove some unneeded section that GCC adds to binary. But we should bear in mind that this is bare-metal. That means that machine starts from the very first instruction it finds in our code. Actually, BootROM branches to the address it loads our application to. Thus, we have to put our start function in the very first place. This is done by respecting the order of object files when linking to resulting code. Thereby, we set the file containing our first code in a first place and the function we want SoC to start as the first function in that file.

Make our binary acceptable by our SoC as boot-code.

To make binary acceptable by a SoC is very important step, but I'll not bother you with all that details. I've made host tool named mkbb_imx8. It is derived from NXP's mkimage_imx8. It gathers some information, fills necessary structures and writes them to proprietary header adding it to specified binary. The only interesting thing here is the address to load image to. This filed tells BootROM where to put loaded application (and jump to). In my tool it is defined as DEFAULT_LOAD_ADDR in source code or can be passed as third, optional parameter. Command line parameter takes priority over define. If you ever need to develop such tool for particular SoC, you can follow the same way as I have made. No tricks here, all you need — just to find the structure of proprietary header for specific SoC. Usually this information can be found in vendor's BSP build system — host tool, often named mkimage_XXX. You will see mkbb_imx8 code in separate directory in the repository.

Put our boot-block in a proper place.

The result of mkbb_imx8 is a binary file which will be accepted by iMX8MP BootROM. The last thing we have to do on our host machine is to put our boot-block in a proper place. iMX8MP expects it to be on media with offset 32kB. Thus, we will write our application to SD-card with this command:


dd if=./barium of=/dev/sdX bs=1k seek=32 ; sync

I said it'll be less than 1kB in size. The code (with its data) is 954 bytes, and it appeared to be 1018 bytes with iMX headers. Maybe we've made one of the most beautiful kilobytes all over the world.

Configure the board for booting from media we need.

If you are reading this, perhaps, you have some board (proprietary or kit) with iMX8MP SoC. If so, you may already have it configured to boot from SD-card or already know how to do it. If not, you need to configure the board for booting from SD-card. From iMX8MP SoC perspective boot device selection is described in section 6.1.5 — «Boot devices (internal boot)» (p. 713). Refer to your product manual to see how it is done on a particular board.

Power on and play!

Insert SD-card into your board, connect your UART converter to UART2 interface of the board, start your favorite serial communication program and power on (or reset) the board. You'll see:


Barium No-Boot V0.1 (iMX8MP) Build: 18:00:00, Feb 12 2025 Initial PC: 0000000000920000 BootROM SP: 0000000000916ED0 Current EL: 0000000000000003

What's about naming? «Barium» stands for «Bare», because barium, besides consonance with «bare», is a metal element. Thus bare-metal is replaced with «Barium». «No-Boot» means it does not boot any kernel or anything at all.

Well, what we see here. First, initial PC is 920000h. This is our DEFAULT_LOAD_ADDR (in my Makefile address is passed via command line argument). You can play with it, set it here and there according to OCRAM memory map (p. 707). Datasheet claims that OCRAM free area starts from 918000h, but the lowest address I could put application to is 920000h. BootROM does not allow to put it to 918000h.

But what is more interesting — stack pointer BootROM leaves for us. It does not allow to place our application in its reserved area, but leaves SP at 916ED0h. This can be used — we can leave SP intact and have stack from 916ED0h to 900000h which is pretty enough space and is actually restricted for our code by BootROM unusable anyway.

And we see that we run in third exception level, EL3 — the highest one. We have carte blanche to do whatever we want on this machine.

What conclusion can be drawn from this? First Platform Loader (BootROM) of iMX8MP leaves SoC in condition where we can omit any initialization which is usually done in assembly language — as we mentioned earlier, stack pointer is already set to some valid address. While setting stack pointer is the only crucial thing we have to do in assembly code for such simple example, we can skip this step by excluding all .s-files. Nothing else should be changed, just don't forget to set main .c-file (in our case — barium.c) as the first one in your objects list. And main function (in our case — barium_main()) should be the first function in this file. Keep in mind — this is bare-metal, and the first function in your code is the entry-point regardless what ENTRY() linker directive points to. (You won't even see it in my .lds-file). Despite our goal was exact opposite, the whole bare-metal project can work without a single .s-file on iMX8MP! You can try it. Thus, we can call iMX8MP very high level SoC, or some kind of a «C-SoC».

You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage I» directory).

That's all. Finally we have a good window to learn ARMv8 aka ARM64 aka Aarch64 on a real hardware — we have assembly files and C-files to play with, we have build system for this project — Makefile, linker script, we have host tool to make proper boot image, and we have some kind of a debug interface — UART. That looks like sufficient setup for further studying.

16/01/2025

The real «Hello World» from embedder (2): practice, prepare

Between theory and code

In the previous post we've reviewed bare-metal development: what it is, the theory of it, its workflow, what information we need and where to get this information, its profits, bottlenecks and limitations. This — how SoC is organised and it works, most of all somehow know from theory (school or university) and/or practice (working on a high level or via some HAL). The next post will describe the well known to all of us process — coding. But what is between that areas of theory and practice? What is between the knowing how PLLs and clocks work, knowing that UART is configured by writing some values to some registers and process of writing device-tree nodes and calling functions of HAL? How does that magic of making certain SoC functioning really arise? The process of preparing to code is described in this post — the practice of getting information, gathering it and planning jobs. Excuse me for not feeling sorry for you — not a single line of code will be written in this post, but I will describe this (middle) part of job up to the single bit. This is to let you know clear enough how it is done.

The plan

The plan we need to carry out to reach our goals looks like:
1. Choose the SoC we will work on and get its documentation.
2. Find out the condition BROM leaves SoC in, and see what is initialised for us and what is not
.
3. Find out what exactly PLLs and clocks we have to configure to start up hardware-blocks.
4. Configure UART TX and RX pads.
5. Configure UART hardware-block.
6. Output some string.
7. Finish with infinite loop outputting characters received from UART.
8. Make proper boot image and put it in the place BROM of our SoC expects it to be.

The practice

Let's get it started.

1. Choose the SoC we will work on and get its documentation.

We will work on NXP i.MX 8M Plus (iMX8MP). This is multi-core (mine is quad-core) ARM Cortex-A53 (ARMv8-A), SoC (with additional Cortex-M7 core). Some kind of what we need and interesting to play with. Remember and bear in mind what we were talking about in previous post — bare-metal development is strongly tied to certain SoC. Thus, if you are about to develop stand-alone application for any other SoC, then the practice, we will do in this post, is not for you. You can read it as an example of workflow only. Once, we have chosen SoC, we download datasheet describing it. In our case it is «i.MX 8M Plus Applications Processor Reference Manual» (IMX8MPRM.pdf, I have Rev. 3, 08/2024). And, to entertain you a little, here is the SoC itself:

NXP iMX 8M Plus

2. Find out the condition BROM leaves SoC in, and see what is initialised for us and what is not.

To find out the condition BROM leaves SoC in we look for section, which describes what BROM enables and what does not. This is section 6.1.4.2 «Boot block activation» (p. 706). This section claims that BROM of iMX8MP activates (in addition to some others) these blocks: Clock Control Module (CCM), Input/Output Multiplexer Controller (IOMUXC). We will boot from SD-Card, thus Ultra-Secure Digital Host Controller (USDHC) will be enabled also (but this is obvious). Let's proceed to clocks BROM has initialised for us. In the next section 6.1.4.3 «Clocks at boot time» (p. 707), table 6-3 «PLL setting by ROM» we see which PLLs are enabled: ARM PLL at 1GHz, System PLL1 at 800MHz and System PLL2 at 1GHz.  Let's remember PLLs we have enabled: System PLL1 and System PLL2. We don't care about ARM PLL at this point, as it controls ALU only and is enabled and configured already by BROM. Proceed to Table 6-6 «CCGR setting by ROM» (p. 708), and see which clocks are enabled and which are not. Scrolling down to UARTs (p. 710) and see that BROM enables none of UARTs. iMX8MP boards usually use second UART for debug, thus, let's remember that its clock number is 74.

3. Find out what exactly PLLs and clocks we have to configure to start up hardware-blocks.

From the previous paragraph we see that none of UARTs hardware-blocks is enabled by BROM. Well, we have to find out how to enable it by ourselves. Let's start with CCM structure, it is described in section 5.1 «Clock Control Module (CCM)» (p. 227). Looking at Figure 5-1 «CCM Block Diagram» (p. 228) we see that clock ticks pass from clock generators (on the left side) via PLLs (or bypassing them), then to CCM's Clock Root Generator which has clock slices, then clock slices form out clock roots and, finally, come out to hardware-blocks (on the right side). Well, this scheme looks more complicated than that we've discussed in the previous post. The idea of Clock Roots becomes more clear if we look at 5.1.2 «Clock Root Selects» (p. 228). Let's scroll down to Slice Index №95 (p. 241). 95 is the slice Clock Root of UART2. In the column «Source Select» we see that it can be driven by few outputs. We will drive our UART by SYSTEM_PLL2_DIV5. Let's remember its value 010b. As we know already, System PLL2 is enabled at 1GHz. Here we need to configure its outputs — ensure its DIV5 (1GHz div 5 is 200MHz — we'll need this value later) output is enabled. This is done by configuring System PLL2 General Function Control Register which is described in section 5.1.8.32 «SYS PLL2 General Function Control Register» (p. 509). We will set all PLL_DIVx_CLKE bits and PLL_CLKE. The address of this register is ANAMIX base + 104h. After we have enabled PLL outputs we have to select proper clock root for UART2 hardware block. This is done by configuring CCM_TARGET_ROOT №95. It is described in section 5.1.7.10 «Target Register (CCM_TARGET_ROOTn)» (p. 412). We see that here we need to set enable (28th bit) to 1 and MUX (24-26 bits) to the value we've remembered earlier 010b. The address of this register is CCM base + 8000h + 95 (slice index we need) * 80h. The resulting value we have to write to the register is 12000000h.

4. Configure UART TX and RX pads.

Well, PLLs and clocks are configured and enabled. Now we have to find out how to configure UART pads. First, let's set proper AF for our UART. Alternative functions are described in table 8.1.1.1 «Muxing Options» (p. 1287). Let's scroll to UART2 (p. 1307). Here we see that UART2_RX port can be routed to one of these pads: UART2_RXD, SD2_DATA0, SD1_DATA3 and SAI3_TXFS. The first one is what we need. UART2_TX port can be routed to one of these pads: UART2_TXD, SD2_DATA1, SD1_DATA2 and SAI3_TXC. The first one is what we need. Both UART2_TXD and UART2_RXD have mode called ALT0. Let's proceed to section 8.2.4 «IOMUXC Memory Map/Register Definition» (p. 1344). In this table we need to find our UART2_RXD and UART2_TXD they are on the bottom of page 1350 and on the top of page 1351 correspondingly. Here we see that their reset values both are 5h (remember that value for a while) and absolute addresses are 30330228h for UART2_RXD, and 3033022Ch for UART2_TXD. Then click on the link in the right column. From section 8.2.4.134 «SW_MUX_CTL_PAD_UART2_RXD SW MUX Control Register» (p. 1540) and section 8.2.4.135 «SW_MUX_CTL_PAD_UART2_TXD SW MUX Control Register» (p. 1542) we see that MUX_MODE is represented by lowest 3 bits of this registers. Also, we see that 5h (the value we've remembered recently) corresponds to 101b. That means that both pads we need are routed to pads we don't need  GPIO5 24 and GPIO5 25 in this case. Thus, we have to configure those pads correctly for our needs — set both to zero (ALT0). To achieve that, we need to write zeroes to 30330228h and 3033022Ch to set proper alternative functions for that pads. But that's not all we have to do to make UART pads functioning correctly. In addition to setting AF, we need to configure physical parameters of that pads. This is done by setting two SW_PAD_CTL Registers: UART2 RXD pad control register, section 8.2.4.286 «SW_PAD_CTL_PAD_UART3_RXD SW PAD Control Register» (p. 1837) and UART2 TXD pad control register, section 8.2.4.287 — «SW_PAD_CTL_PAD_UART2_TXD SW PAD Control Register» (p. 1839). After inspecting these description, we conclude that zero is a good value for both of them. And the last one step we have left to do. UART RX is a little special, because it works as an input function. Thus, we need to select input for it. This is done by configuring DAISY Register, which is represented in section 8.2.4.376 «UART2_UART_RXD_MUX_SELECT_INPUT DAISY Register» (p. 1922). Here we see that 110b «SELECT_UART2_RXD_ALT0 — Selecting Pad: UART2_RXD for Mode: ALT0» (p. 1923) is what we need. Thus, we'll write 6h to 303305F0h.

5. Configure UART hardware-block.

The last thing we have left is to configure UART hardware-block. It is done by familiar steps like — reading registers values (optional), modifying that values (optional) or forming out them from scratch, writing values to registers, waiting for conditions flags (optional). Actually, UART is a simple hardware-block, thus, I will not explain the specific process of configuring it — you will see it in the code, which will be presented in the next post.

6. Output some string.

After UART is configured and running, the game starts. Now we are prepared and ready to output some strings. This is also done by writing a byte to some address (UART register) and controlling TX empty flag to avoid buffer overrun. Here I'll skip detailed description of this process too — see it in the code.

7. Finish with infinite loop outputting characters received from UART.

We will finish with infinite loop outputting characters received from UART. This is done by controlling RX empty flag and reading received byte (UART register) when flag becomes unset.

8. Make proper boot image and put it in the place BROM of our SoC expects it to be.

To make our application load and run on the SoC we've chosen, we have to prepare proper boot image and put it in the place BROM of our SoC expects it to be. This is done by a tool (mkbb_imx8), which is derived from standard NXPs mkimage_imx8. I will not explain how it works and how it was developed at all, but will show how to use it to generate boot block (and how and where to place it) for our SoC in the next post.

Conclusion

We've made it. Now we have gathered all the information we need to start writing code for iMX8MP SoC and we are ready to proceed. And now you know what lies between the theory and the daily routine of embedded developer. In the next post we will develop the application — we will write in an assembly language and C-code, compile, link objects to binary, make a bootable image of it, and put it in the right place on a storage media. It'll be a very small program that will run on the iMX8MP and on this SoC only. But it will give a platform for learning ARM64 machine. You'll be able to play with the ARMv8 machine from the ground, as in assembly language as in C-code — start (kick) or not start its cores, switch or not switch exception levels, output values of registers, and so on. Finally, we'll have a wide-open window to ARM64 machine!

17/12/2024

The real «Hello World» from embedder (1): preface, theory

Preface

Any learning process eventually comes up to some examples, some practice. In the case of software development, practice starts from so-called «Hello World». «Hello World» is the very first code every software developer on every platform, API, or framework should do on his very first days or steps. Embedded software engineers are not an exception.

There is a lot of «Hello Worlds» out there on the Internet and in books. Those «Hello Worlds» are in any kind of programmings languages, APIs, and frameworks. But the vast majority of those «any kinds» are in pure software — Java, JS, C++, etc. Thus, embedded software developers, those of them who want to study machine architecture, get in touch and play with it, suffer from a lack of examples — they have no starting point.

You can argue with that — there are examples in assembly language for all types of architectures, assembly language should let us study machine architecture and its behaviour. Yes, there is a lot of examples in assembly language, but those examples are for very high (let's say — top) levels of runtime environment — Linux and Mac user-space. This approach to learning machine architecture doesn't lead to understanding of how it works and how it is organised.

Actually, it gives us a very narrow slit to machine architecture. Even in kernel-space, we are very limited in abilities to learn the machine. That's because all the hardware initialisation is already done for (and before) us by bootloader and kernel itself. The second problem is so-called «conceptions and abstractions of operating systems». Those «abstractions» hide a lot of machine architecture from us while «conceptions» arise a lot of software that serve an OS itself and make it what it is from user's point of view. Thus, working even in kernel-space, we learn more of those conceptions and principles rather than a machine.

Let's go on and discuss our last chance — MCUs (Micro-Controller Unit), which are cheap, popular, easy to start up and do let us be as close to the machine as possible. MCUs always had a crucial difference with CPUs and SoCs (System on Chip) — they have no MMU and usually are single-core ALUs. After the beginning of the ARM64-era — the times when not only phones and tablets but also desktops (and even servers) and laptops are running on ARM64, MCUs eventually have got another one drawback ARM32 (the most common architecture of worth anything at all MCUs) architecture became outdated even as a single-core platform for studying. Now we are living in the days when it's almost useless to learn ARM32 architecture. It's more a waste of time because the difference between actual and almost omnipresent ARM64 and rapidly becoming obsolete ARM32 is enormous.

Thus, we come to that point where we need a comfortable, big enough window (let's say — better a door or a nice gate) to the machine that represents modern ARM64 architecture. Usually it is done by programming on an emulator. Developing for an emulator is like a game because of its abstraction. Games (and gaming) give you minimal or even zero risks by the price of insignificant wins. To get real valuable practice and results, we need a real «Hello World» on real hardware.

Learning a machine architecture (and its behaviour) is done by so-called «Bare Metal» programming. «Bare» means that you are working on a machine without an OS, without a kernel, and even without a bootloader. Actually, you develop code that works instead of a bootloader or runs where and when the very first part of user bootloader (SPL — Second Platform Loader) usually does. «Metal» stands for hardware or machine and means that you work on a real hardware — not a virtual machine or an emulator.

In these topics we'll cut a good window for studying modern ARM64 machine. This will be the «Hello World» from embedded engineer and a nice toy for embedded developer.

Theory

Let's start with a little overview of a computing machine, specifically about its heart — CPU or SoC. SoC provides a set of functions (functionalities); each functionality is provided by an according and separate hardware-block. Thus, SoC consists of hardware-blocks or represents a set of hardware-blocks. Each hardware-block is driven by a clock; each clock is derived from PLL, Phase Lock Loop — a small circuit that generates one or more frequencies from an input clock source. Each PLL is driven by an oscillator. So, turning a hardware-block on, besides powering it, is just enabling the clock it is connected to. Actually, you will not find such element like «clock» on your board. «Clock», as an element — is just a conception of embedded developers. Really, «clock» — is one of the outputs of PLL, which can be enabled or disabled by software. Real clocking sources are oscillators and PLLs. You can see clock generating scheme below.

Clock generating scheme

But enabling clock is not enough to start using functionality of hardware-block. In most cases there are two more things. The first — usually, SoCs provide more functionalities than they have pins, or, in terms of embedded software, more often we say — pads. It leads to so-called «alternative function» (AF) conception. This means that some pads, as we say in software theory — can be configured to specific (alternative) function — I2C SDA or UART TX, for example. Configuring of AFs is done by Input/Output Multiplexing Controller (IOMUXC) or, more commonly called Pin Multiplexer (PINMUX). PINMUX is a built-in hardware-block. So, it has to be turned on (clocked) too, like other hardware-blocks. From the hardware point of view, PINMUX simply routes signals from SoC's pads to specified hardware-block inputs/outputs — to I2C bus controller or to UART controller in example above. In other words, PINMUX switches signals between hardware-blocks inside of a SoC on one end and SoCs ball (or pads) that we can see on the chip package. You can see alternative functions multiplexing scheme below.

Alternative functions multiplexing scheme

Second thing is configuration of hardware-block. This is done by code. After hardware-block is clocked (enabled) it stays in some default condition. We need to write specific values to specific registers to make it function according to our needs.

Clocks for hardware-blocks and corresponding PLLs are enabled by code. Hardware-blocks are configured by code too. But no program can run on a non-clocked CPU. So who and how starts the main PLLs and clocks that drive the ALU? This functionality is hardcoded into so-called BootROM, or BROM. BROM is microcode of SoC itself; it is located inside SoC. BROM initialises the essential hardware minimum required to start user machine code. It starts up a minimal amount of hardware-blocks and after that performs as FPL (see below). BROMs are very different in functionality — some of them just start code from built-in memory, not even loading it into RAM (like BROMs of MCUs do), some of them initialise a bunch of hardware-blocks and can even operate filesystems (like BROM of Raspberry's Broadcom does). The most common condition BROM leaves SoCs in is: one ALU core is started, SRAM (or OCRAM — OnChip RAM) initialised, and one of the boot sources initialised. Initialisation of the rest functionality is left for user code.

Any SoC uses its own, proprietary FPL (First Platform Loader). FPL is a part of BROM we've discussed earlier. To run our code on a SoC we need so-called «boot image», which is made out of our code. We need to say few words about contents of boot image. It is not a regular file we get from compiler. And the architecture mismatch is not the only reason. Usually compiler builds ELF (Executable and Linkable Format) file. ELF-file contains a lot of extra information which is used by OS or some other environment. Also ELF-file can include debug information, symbol names, etc. But while working on bare-metal we have to get rid of all OS-specific, debug and other environmental information. This is because SoC will execute raw code only and treat any binary data as straight, linear stream of code (with some addition of data). Bare-metal needs raw code to function correctly. If we try to load whole ELF-file as a bare-metal application, most probably we'll get some kind of illegal instruction exception because that extra information in ELF-file will not match with correct instructions codes. The process of making raw code, or raw binary from ELF-file is called stripping. Now, let's proceed with FPL. Any FPL expects boot-image in a specific format and in a specific place. Thus, after we've compiled our source code to (set of) object files, linked that objects into single binary, and stripped it down to pure machine code, we have to pack it and make it comply with specific requirements of a particular SoC and its BROM. This is done by a tool, which usually is named «mkimage», sometimes something like «mkimage_XXX», where XXX is replaced by the name of SoC or the name of a family of SoCs. Usually, mkimage adds some headers, which BROM reads, and sometimes CRCs, to machine code. BROM uses this information to verify the boot image. Then we have to put this image in a particular place — usually on a SD-Card or eMMC with some offset from zero. Offset is needed to make bootable media also usable as storage media — it leaves space for partition table and filesystems.

This is what we have about hardware. Let's proceed to software. There's not too much to do and discuss here. We can work in assembly language forever — this is interesting and may be useful. By limiting ourself to assembly language only we can omit using stack. But anyway, at some point we want to dive out into C-code. At this point we'll need stack because C-compiler will use it intensively. Thus, we have to initialise it — that's all we need to know at the moment. Initialisation of the stack is done by setting the stack pointer register to a value representing a valid memory address. Stack (most often) grows down, so we have to choose an address for it according to its behaviour — to prevent it from touching the bottom of RAM and from destroying our application in case it is set above. And, of course, we can't set it higher than the top of accessible RAM.

Now let's proceed to our specific task. Let's say we want to make our SBC (Single-Board Computer) output a «Hello World». How can we do that? Let's confess that outputting (drawing) strings on a display is a little complicated task for a bare-metal beginner. So we'll output our «Hello World» via the most common debug interface all over the world — UART. UART functionality is provided by a particular hardware-block. So we need to start clocking that hardware-block, set its pads and configure it.

First, to enable a clock for UART we have to know which clock exactly we need to enable on the exact SoC. As we mentioned earlier, every clock is derived from a particular PLL. Thus, we need to find out what PLL provides the clock we need. All this information is presented in a datasheet and is rigidly tied to a certain SoC. The second part is to set AF — to configure pads for UART — its TX and RX. Here we need to turn on PINMUX (PLL, clock) and configure pads we need to functions we need. And the last part — configuration is done by writing values to a memory location mapped to an address of hardware-block — PINMUX (pads) and UART (baud rate, parity, etc.) in our case. What configuration must be written is described in a datasheet and, again, is rigidly tied to a certain SoC.

After writing this code we have to compile, link and strip the resulting file to raw machine code, make boot-image by forming out BROM header and adding it to machine code we've got on the previous stage. Then put our boot-image to specific place on a specific media and power on SBC.

Theory ends at this point. In the next part, we'll choose specific SoC, gather information needed to design the «Hello World» for the chosen SoC. With this information we will form out the plan of steps for third part.