14/05/2025

Barium No-Boot (5): from UP to SMP: go multi-core!

Intro

Well, we have started our SoC, configured necessary PLLs and clocks, setup minimal amount of hardware blocks — GPIO, UART, made it output information — simple debug interface, enabled caches and learned how to set the ALU speed. In the first post of this series we've reviewed benefits of learning modern ARMv8-A machine. Among others, we've mentioned one significant, important thing to learn these days — multi-core ALUs. In this post we will start up all cores of our SoC (iMX8MP) and run independent code on each of them.

So what we have at this moment, at Stage II? An educational platform that has entry points for working in C and assembly language (to learn ARMv8 architecture) and works in single-core mode. Today we want it to be reworked into multi-core platform, which, in addition to meeting the same technical requirements — entry points for working in C and assembly language, will be able to execute separate code for each ALU core — we want to go SMP!

Redesign

We are talking about SMP feature today, but core start-up routine is not that big compared to the rework we have to do this time. Thus, we'll cover it later and start our lesson with (almost complete) project redesign to comply multi-core development, which is actually the biggest part of the work to go multi-core.

We will go from top to bottom level because it is easier to explain and understand today's work. We'll redesign our main C-function (barium.c), C-runtime (crt0_64.s), modify startup code (start_64.s) and even linker script (barium.lds).

C-code

Let's start from the top level — payload C-code organization of our main C-file — barium.c. We want a multi-core platform that has entry points in C (as well as in assembly language, but we'll cover this later) for each ALU core. From the architecture point of view, we could implement a separate routine for each ALU — from barium_main_0() to barium_main_N() and it would look pretty clear code to a human. But this would look more like asymmetric multiprocessing programming (AMP). Thus, we'll keep single main function consisting of parts of common for all cores code and parts of separate code for each of the cores the function is executed by. This is to know how does it feel — writing parallelly executed code for multi-core environment. Hence, barium_main() will stay in the project and get a new parameter — _CoreNo, to determine core number it is running on.

When we were working in the single-core mode, some ALU was running and initialized the entire system and executed our payload code. That was zeroth ALU. This time we'll leave this part of project intact — zeroth ALU will initialize the system, including new steps — starting other ALUs. After that, just like it was at previous stage, it will fall into measure loop. Third core will perform reading from and outputting to UART loop. The remaining two will fall in dead loop at this stage.

Let's say we want each ALU to output information about itself, like our zeroth did at previous stages — it will be interesting to review our initialization routine and some aspects of CPU start-up process. But we can assume that we will get a mess in the UART if we start all cores in parallel. And, believe me — we will have a mess of symbols in UART. This is because we do not have any synchronization and/or locking mechanisms yet. Hence, how should we output information about (and from) each ALU core that runs this code almost simultaneously? We'll do some desynchronization. For that reason we will have an array of close but different values for each core — uint32_t desync_loops []:


   uint32_t desync_loops [] = { 0, 250000, 350000, 450000 };

These values are empty loops which every core will perform before proceeding to its payload, specifically to outputting initial information via UART. There is no statistics or calculation behind that values — I've just choose them by experiment. But this idea gives us correct initial output from all cores: from zeroth to third. Of course output is done once by each core, except the third, which continues to use UART in both directions in an infinite loop after outputting initial information.

So, our barium_main() function performs desynchronization loop and after that, in case of zeroth ALU core, executes newly added init_board() routine which initializes the hardware of our SoC (PLL, clocks, GPIO, UART), sets ALU speed, outputs general information about Barium and starts remaining three cores. We have a new routine for the last step:


    void start_cpu(uint32_t _Core, uintptr_t _EntryPoint);

And use it as shown below:


    start_cpu(1, (uintptr_t)_crt0_main);
    start_cpu(2, (uintptr_t)_crt0_main);
    start_cpu(3, (uintptr_t)_crt0_main);

Where entry_point is declared as:


    extern void _crt0_main(void);

After that goes some part of code which is executed by all ALU cores (including zeroth) — output information about itself. And, finally, there are two parts of code. One for zeroth — generating of measure impulses, and one for third — cycle that outputs symbols read from UART.

That's all here. You can see the rest in the code. start_cpu() routine itself will be reviewed later.

C-runtime

Now let's dive a little deeper — next level is C-runtime, our crt0_64.S file. Regarding machine code — program itself and program writing process, all ALU cores of homogeneous multi-core system (or all homogeneous ALU cores of heterogeneous system) work in the same and absolutely identical way. Do you remember what is C-runtime for? C-code uses stack and there is C-runtime for setting Stack Pointer (SP). Stack is used by any separate execution flow — be it a code running on a whole ALU core, or a separate process — so-called «task», or «thread» which runs in a time slot given to it. And any execution flow needs a separate stack because it is used independently. SP is a per-core, so-called special purpose register. That means that today we have to do the same thing in C-runtime as we did at previous stages — prepare our system, but in this case — each of its cores, for C-code. Thus, we have to set stack pointer for each ALU core. And, of course, this must be done in assembly language.

But how can we organize as many as four stacks? Let's just put our remaining three stacks below zeroth. Let's say each ALU core will have stack of equal size. Hence, the formula for each SP will be: Stack Base Top - (Stack Size * Core No.).

Maybe, you've noticed that extension of crt0 assembly file is capitalized now — .S. Capitalized extension of source file means that this file must be passed through preprocessor before compilation. Yes, we want our code to be a little configurable via environment variables at this stage, in particular, we want stack top and stack size to be configurable.

As shown above, we use _crt0_main as entry point for all our newly powered up cores. And here is our SP calculation routine:


    # Get core number from MPIDR and store it as third parameter:
    mrs x2, MPIDR_EL1
    and x2, x2, 0xFF
    
    # Set stack pointer (per core) saving it as first parameter to
    # barium_main function. Calculate stack pointer for current core.
    # SP = SP base top - (core stack size * core No.):
    mov x5, STACK_SIZE
    mov x6, SP_BASE_TOP
    mul x1, x2, x5
    sub x1, x6, x1

ARMv8 has a per-core system register called Reset Vector Base Address Register (RVBAR). Let's get it to fourth parameter of barium_main() to examine later:


    # Store the RVBAR as fourth parameter:
    mrs x3, RVBAR_EL3

After setting stack pointer we are ready for C-code and branch to our redesigned barium_main() routine. This is how our assembly language multi-core entry point routine looks like — we can add assembly code depending on core number (x2 register in our case). Now we can perform different routines depending on core number in assembly language. That's all we have to do at this level.

Startup

Let's proceed to the most low-level startup routine — _start function. There is not too much to rework here — the smallest part of work, just enable SMP feature of ARMv8:


    # Load CPU Extended Control Register:
    mrs x4, S3_1_C15_C2_1
    # Enable SMP:
    orr x4, x4, #(1 << 6)
    # Set CPU Extended Control Register:
    msr S3_1_C15_C2_1, x4

That's all we have to do at this level today. As for the code the project redesign is completed. But for human it's hard to perceive architecture of multi-core application which is described in text and written in code in a direct, single-flow manner. To get better overview for human, we'll provide the architecture scheme which represents routines executed by all cores and order of that routines. See below (click on the image to enlarge it):

Barium multi-core architecture

Linker script

As we are passing entry point address to our start_cpu() routine, we have to be able to get this address at runtime. Who (or what) is responsible for address resolving? It's linker — it collects all symbols from all object files being linked and resolves all addresses in resulting executable file. We are working on bare-metal, and each machine (SoC and it's BROM) have different addresses it loads executable to. Hence, we have to inform the linker about memory layout of our application and, specifically, where the memory starts — this will be the base address of our .text section. Also, we have to describe this section. If we omit memory layout description, linker will assume that memory (and our code) starts from zero (which would be wrong on, maybe all modern machines) and the address will be resolved wrongly. As we want our project to be a little configurable, we will pass our linker script through preprocessor (as we did with C-runtime assembly language file). This is to be able to specify memory start address via environment. We should always specify memory start equal to the address BROM loads our image to. And we already have that variable in our environment — IMAGE_LOAD_ADDR. We've used it to build boot image header at previous stages. This time we will add description of our memory (OnChip RAM) to linker script:


    MEMORY
    {
        .OCRAM : ORIGIN = IMAGE_LOAD_ADDR,
        LENGTH = (576 * 1024)
    }

And describe the location of .text section:


    .text (READONLY) :
    {
        . = ALIGN(8);
        *(.text*)
    } >.OCRAM

Linker does not pass its script to preprocessor regarding of the file name. We have to do it ourselves. Thus, we rename our linker script to .LDS and pass this file through preprocessor manually. Below is the rule to make .lds from .LDS (Makefile).


    $(PRODUCT).lds:
        @$(CPP) -P -DIMAGE_LOAD_ADDR=$(IMAGE_LOAD_ADDR) $(PRODUCT).LDS -o $(PRODUCT).lds

So now our project redesign is complete. Our build system compiles files with compiler, linker resolves addresses into correct values, _start enables SMP feature on ARMv8, _crt0_main sets four stack pointers for each ALU, barium_main() initializes the SoC, starts all ALUs with entry points. The new stage is set up and ready to be run.

ALU start-up

The only thing left to overview today is the most platform/hardware-specific routine — starting ALU cores. This is done in gpc.c/gpc.h files (General Power Controller). Here we perform ALU start-up sequence. This routine is not architectural and is pretty straight-forward, thus, we'll not bother too much with it. You'll see that the first step is to set entry point address for core. After that goes the reset and power-up routine. Lines of code have comments referencing page numbers of datasheet.

This is how we go from UP (uniprocessor) to SMP (symmetric multi processor) on iMX8MP SoC.

The results

It's time to play, get results, review them and make conclusions. Let's see what we get:


    Barium No-Boot V0.3 (iMX8MP)
    Build: 03:22:03, May 13 2025
    Running at: 2200MHz

    ALU Core №: 0
    Reset VBAR: 0000000000000000
    Initial PC: 0000000000920020
    Initial SP: 0000000000920000
    Current SP: 000000000091FFC0

    ALU Core №: 1
    Reset VBAR: 0000000000920020
    Initial PC: 0000000000920020
    Initial SP: 000000000091F000
    Current SP: 000000000091EFC0

    ALU Core №: 2
    Reset VBAR: 0000000000920020
    Initial PC: 0000000000920020
    Initial SP: 000000000091E000
    Current SP: 000000000091DFC0

    ALU Core №: 3
    Reset VBAR: 0000000000920020
    Initial PC: 0000000000920020
    Initial SP: 000000000091D000
    Current SP: 000000000091CFC0

    Awaiting symbols from UART:

We see four similar outputs from each core. First, let's check how our stack pointers are calculated and set: 920000h for zeroth core, 91F000h for first, 91E000h for second and 91D000h for third. So, everything is fine here, our stack pointers are calculated correctly down from 920000h with the step of 1000h for each core. Also notice that last 12 bits of current SP of each ALU are the same — FC0h. That's because we read current SP in the same code for each ALU and the «consumption», or use, of stack (up to the line where we get it) is identical for each core in our project. Thus, the memory map of our application looks like (build with default values of IMAGE_LOAD_ADDR and STACK_SIZE):

Barium memory map

We've also moved reading of initial PC from _start to _crt0_main. That was done to be able to get it on all cores. Because _start is executed by zeroth core only and remaining cores would not run that code. So, now initial PC is not address off the first instruction of our application. And that's why it is not equal to IMAGE_LOAD_ADDR anymore, but address of ALUs entry point instead.

What else is remarkable here? We've added _RVBAR parameter to our barium_main() function. You may have already noticed, that cores from the first to the third have the same Reset VBAR — 920020h. That is actual address of out entry point (_crt0_main) for that cores. But zeroth core has RVBAR of all zeros. That's how it is set for zeroth core by hardware and the address where BROM is located.

Another interesting point. You could notice, that in previous posts, when we've started zeroth ALU and configured it, the «Initial SP», we've output, was the stack pointer value before we've set it — stack pointer value that was set by BROM. But this time we output the value we've calculated. You can be curious what are the values of stack pointer of other ALUs. I wondered about that too, and checked it. The reason why this functionality is omitted in this post is that initial stack pointer values for remaining ALUs are actually garbage.

That's all for today. You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage III» directory).