DaftSoft: The real «Hello World» from embedder (3): result, play

Finally...

Well, we've discussed goals and benefits of bare-metal development, have gathered all the information we need to work on a specific SoC, and now we are ready to start writing code. In this post, we will practice. We will go thru the whole process of bare-metal coding, compiling, linking, stripping and, finally, will get a real stand-alone application for a real specific ARMv8-A SoC. Can you imagine — it will be slightly bigger than 1kB in size (including all necessary headers)! But this tiny application, as a result, will perform the minimal task — output strings via UART. And even more — it will read from UART and output those bytes! Also we will add some functionality to our application, do some interesting stuff and perform some experiments.

The plan

The plan we need to carry out to reach the final result looks like:
1. Organise our code — .s and .c files.
2. CPU start code — assembly code.
3. Initialise the stack pointer — assembly code.
4. Initialise PLLs, clocks, pads, UART hardware-block — C-code.
5. Write some payload code — write to and read from UART.
6. Compile and link our code into binary file.
7. Make our binary acceptable by our SoC as boot-image.
8. Put our boot-block in a proper place.
9. Configure the board for booting from media we need.
10. Power on and play!

Hands-on code

Let's get it started.

1. Organise our code — .s and .c files.

This is bare-metal — we don't have bootloader and/or kernel behind us. Thus, there are some routines we have to do in assembly language before writing any C-code. So, we will have assembly files. Actually two. They can be combined into a single .s-file, but we will stick to traditions. These files are (common names on 64-bit platforms) — start_64.s and crt0_64.s. But what exactly should be done in assembly files (language) on bare-metal? The answer is — the things we can't do in C. We can access registers of hardware-blocks via volatile pointers as this is done via so-called memory-mapped input/output (MMIO). But we can't access CPU registers in C directly, thus start_64.s file contains the very first code that cannot be done in C. Specifically CPU start-up code — applying erratas, switching modes, exception levels, setting interrupt vector tables, etc. Now let's go on with crt0. You probably are familiar with it (or, at least, heard of it) — so called «C-runtime zero». In user-space, usually, this is some object file, which is implicitly added to our code by linker prior to file containing main() function written in any high-level language. In user-space crt0 prepares argv and argc which are passed to developer's main() function, makes pointers to environment variables code starts with, etc. As you can assume, this file is unnecessary if we are about to write in assembly language only. Its content is to prepare the environment (be it a user-space program or bare-metal code) for C-code, in our case — set stack pointer to some valid address. After all necessary assembly code routines are done, we will break out to C-code and realise application payload — read from and write to UART. Assembly language files are for startup code and stay as the window to ARMv8-A architecture, our sandbox — one of our goals.

2. CPU start code — assembly code.

CPU start code. In this example we don't need to do anything here, but let's save some values for further research — (believe me) it'll be interesting to explore. According to AAPCS64 (AArch64 Procedure Call Standard — routine calling convention for 64-bit ARM machines), x0 will be the first parameter of called function, x1 — the second, x2 — the third, etc. Let's save initial program counter (PC), stack pointer (SP) and exception level (EL) while they are intact by our code — at the very beginning of it. We will save initial PC, SP and EL by storing them in x0, x1 and x2 registers. Later you will see how they will get into our main C-function and will be output to UART. So, start_64.s will look like:


.arch armv8-a

.globl _start_64
_start_64:

	# Store the address of the very first instruction. This will be the
	# address BROM puts our code to. According to AAPCS64 x0 will be the
	# first parameter of called function. So, save current PC to x0:
	adr x0, .

	# Store initial stack pointer address as the second parameter:
	mov x1, sp

	# Save Current EL to x2, and it will get to third parameter:
	mrs x2, CurrentEL

	# Branch to crt0's main:
	b _crt0_main_64

3. Initialise the stack pointer — assembly code.

In our case, there is the only thing we have to do in crt0 to get prepared for C-code — set stack pointer. Stack pointer is set by just writing memory address value to SP register. We want our x0, x1 and x2 registers to pass thru this file to C-function, so we avoid using these registers here. So, crt0_64.s will look like:


  .arch armv8-a

  .globl _crt0_main_64
  _crt0_main_64:
	# Set stack pointer at the top of OCRAM 97FFFFh 
	# Internal ROM and RAM memory map (NXP IMX8MPRM, Figure 6-2, p.706):
	# Move the address to x3 without last Fh (aligned to 16):
	mov x3, 0xFFF0
	movk x3, 0x0097, lsl #16

	# Set stack pointer:
	mov sp, x3

	# Finally, break out to C-code:
	b barium_main

From here on, you will see links to documents in brackets. The format is as follows: (document name, section, table, figure name or number, page number).

4. Initialise PLLs, clocks, pads, UART hardware-block — C-code.

After that, we have to initialise PLLs, clocks, pads and UART controller.

We begin with ccm_plls_init() function. The only thing we have to do here is to enable div 5 of PLL2 because we decided to drive UART HW block via PLL2 div 5 output. Therefore, this function looks like:


  void ccm_plls_init(void)
  {
  	/* Enable div 5 output of PLL2 as we dicided to clock UART via this output.
  	 * Actually, this is unneeded - it appears to be enabled. */
  	setbits_le32(CCM_ANALOG_SYS_PLL2_GEN_CTRL, PLL_DIV5_CLKE);

  	return;
  }

After that we proceed with uart_init() function:


	uart_init(uart_base, UART_BAUD_RATE);

The most interesting part of this routine is the initialisation of pads, which looks like:


	...

	/* Set UART2 TXD Alternative function */
	iomux_set_pad(UART2_TXD_ALT, UART2_TXD_MUX);
	/* Set UART2 TXD Pad properties */
	iomux_set_pad(UART2_TXD_PAD, UART2_TXD_MUX + ALT_PAD_STEP);

	/* Set UART2 RXD Alternative function */
	iomux_set_alt(UART2_RXD_ALT, UART2_RXD_MUX);
	/* Set UART2 RXD Pad properties */
	iomux_set_pad(UART2_RXD_PAD, UART2_RXD_MUX + ALT_PAD_STEP);

	/* Select UART2 RXD input (Daisy Chain) */
	iomux_select_input(UART2_RXD_DSY, UART2_RXD_SEL);

	...

As we discussed earlier, we set alternative functions of pads, their properties and input type of UART2 RXD pad — its so-called «daisy chain».

The rest of uart_init() function initialises UART hardware block. This is not interesting enough to list it here. You can see it in the code.

5. Write some payload code — write to and read from UART.

Payload code. Since we have initialised all the hardware and have some high-level functions, the code becomes self-explanatory. The only thing I want to explain is the uart_chars_counter variable and UART_FIFO_MAX define. On each output we increment special counter and when it reaches maximum length of FIFO we wait for real transfer to finish by waiting for TXDC bit in UART Status Register. We have to check this because without this we can get in situation where UART can skip symbols being output. This is done in uart.c file. You can see the payload functionality in barium_main() function located in barium.c file.

6. Compile and link our code into binary file.

Compile and link. Compilation and even cross-compilation is not something new — as usual, we compile separate source files, link them in a single ELF-file then perform the only extra step — dump pure code of it into a binary file. You'll see a linker script in the repository. Actually, you can build the project without it. I've used it just to remove some unneeded section that GCC adds to ELF. But we should bear in mind that this is bare-metal — that means that machine starts from the very first instruction it finds in our code. Actually, BootROM branches to the address it loads our application to. Thus, we have to put our start function in the very first place. This is done by respecting the order of object files when linking to resulting code. In our case it is done via PRODUCT_SRC variable in our Makefile. Thereby, we set the file containing our starting code in a first place and the function we want SoC to start as the first function in that file.

7. Make our binary acceptable by our SoC as boot-image.

Making binary acceptable by a SoC is very important step, but I'll not bother you with all that details — what information, where and how to find it to know the fields we have to fill for our SoC. I've made host tool named mkbb_imx8. It is derived from NXP's mkimage_imx8. It gathers some information, fills necessary structures and writes them to proprietary header and adds it to specified binary. The only interesting thing here is the address to load image to. This field tells BootROM where to put loaded application (and jump to after loading it). In my tool it is defined as DEFAULT_LOAD_ADDR in source code or can be passed as third, optional parameter to mkbb_imx8. Command line parameter takes priority over define. If you ever need to develop such tool for particular SoC, you can follow the same way as I have made. No tricks here, all you need — just to find the structure of proprietary header for specific SoC. Usually this information can be found in vendor's BSP build system — host tool, often named mkimage_XXX. You will see mkbb_imx8 code in separate directory in the repository.

8. Put our boot-block in a proper place.

The result of mkbb_imx8 is a binary file which will be accepted by iMX8MP BootROM as boot-image. The last thing we have to do on our host machine is to put our boot-block in a proper place. iMX8MP expects it to be on media with offset of 32kB. Thus, we will write our application to SD-card with the following command:


  dd if=./barium of=/dev/sdX bs=1k seek=32 ; sync

I said it'll about 1kB in size. The code (with its data) I've got with my compiler (ARM GNU Toolchain 14.3.Rel1) is 980 bytes, and it appeared to be 1044 bytes with iMX headers. Maybe we've made one of the most beautiful kilobytes all over the world...

9. Configure the board for booting from media we need.

If you are reading this, perhaps, you have some board (proprietary or kit) with iMX8MP SoC. If so, you may already have it configured to boot from SD-card or already know how to do it. If not, you need to configure the board for booting from SD-card. From iMX8MP SoC perspective boot device selection is described in section 6.1.5 — «Boot devices (internal boot)» (p.713). Refer to your product manual to see how it is done on a particular board.

10. Power on and play!

Insert SD-card into your board, connect your UART converter to UART2 interface of the board, start your favourite serial communication program and power on (or reset) the board. You'll see:


  Barium No-Boot V0.1 (iMX8MP)
  Build: 18:00:00, Feb 12 2025
  Initial PC: 0000000000920000
  BootROM SP: 0000000000916ED0
  Current EL: 0000000000000003
  Awaiting commands from UART:
  Received: 20h

What's about naming? «Barium» stands for «Bare», because barium, besides consonance with «bare», is a metal element. Thus, bare-metal is replaced with «Barium». «No-Boot» means it does not boot any kernel or anything at all (but runs when and where boot loader usually does).

Well, what we see here. First, initial PC is 920000h. This is our DEFAULT_LOAD_ADDR (in our Makefile address is passed via command line argument). You can play with it, set it here and there according to OCRAM memory map (p.707). Datasheet claims that OCRAM free area starts from 918000h, but the lowest address I've managed to put application to is — 920000h. BootROM does not allow to put it to 918000h.

But what is more interesting — the stack pointer BootROM leaves for us. It does not allow to place our application in its reserved area, but leaves SP at 916ED0h. This can be used — we can leave SP intact and have stack from 916ED0h to 900000h which is pretty enough space and is actually restricted for our code by BootROM — unusable anyway.

And we see that we run in third exception level, EL3 — the highest one. We have carte blanche to do whatever we want on this machine.

What conclusion can be drawn from this? First Platform Loader (BootROM) of iMX8MP leaves SoC in condition where we can omit any initialisation which is usually done in assembly language — as we mentioned earlier, stack pointer is already set to some valid address. While setting stack pointer is the only crucial thing we have to do in assembly code for such simple example, we can skip this step by excluding all .s-files. Nothing else should be changed, just don't forget to set main .c-file (in our case — barium.c) as the first one in your objects list. And main function (in our case — barium_main()) should be the first function in this file. Keep in mind — this is bare-metal, and the first function in your code is the entry-point regardless what ENTRY() linker directive points to. (You won't even see it in our .lds-file at all). Despite our goal was exact opposite, the whole bare-metal project can work without a single .s-file on iMX8MP! You can try it. Thus, we can call iMX8MP very high level SoC, or some kind of a «C-SoC».

You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage I» directory).

That's all. Finally we have a good window to learn ARMv8 aka ARM64 aka AArch64 on a real hardware — we have assembly files and C-files to play with, we have build system for this project — Makefile, linker script, we have host tool to make proper boot image, and we have some kind of a debug interface — UART. That looks like sufficient setup for further studying.

DaftSoft

12/02/2025

The real «Hello World» from embedder (3): result, play — Barium No-Boot!