Preface
Any learning process eventually comes up to some examples, some practice. In the case of software development, practice starts from so-called «Hello World». «Hello World» is the very first code every software developer on every platform, API, or framework should do on his very first days or steps. Embedded software engineers are not an exception.
There is a lot of «Hello Worlds» out there on the Internet and in books. Those «Hello Worlds» are in any kind of programmings languages, APIs, and frameworks. But the vast majority of those «any kinds» are in pure software — Java, JS, C++, etc. Thus, embedded software developers, those of them who want to study machine architecture, get in touch and play with it, suffer from a lack of examples — they have no starting point.
You can argue with that — there are examples in assembly language for all types of architectures, assembly language should let us study machine architecture and its behaviour. Yes, there is a lot of examples in assembly language, but those examples are for very high (let's say — top) levels of runtime environment — Linux and Mac user-space. This approach to learning machine architecture doesn't lead to understanding of how it works and how it is organised.
Actually, it gives us a very narrow slit to machine architecture. Even in kernel-space, we are very limited in abilities to learn the machine. That's because all the hardware initialisation is already done for (and before) us by bootloader and kernel itself. The second problem is so-called «conceptions and abstractions of operating systems». Those «abstractions» hide a lot of machine architecture from us while «conceptions» arise a lot of software that serve an OS itself and make it what it is from user's point of view. Thus, working even in kernel-space, we learn more of those conceptions and principles rather than a machine.
Let's go on and discuss our last chance — MCUs (Micro-Controller Unit), which are cheap, popular, easy to start up and do let us be as close to the machine as possible. MCUs always had a crucial difference with CPUs and SoCs (System on Chip) — they have no MMU and usually are single-core ALUs. After the beginning of the ARM64-era — the times when not only phones and tablets but also desktops (and even servers) and laptops are running on ARM64, MCUs eventually have got another one drawback — ARM32 (the most common architecture of worth anything at all MCUs) architecture became outdated even as a single-core platform for studying. Now we are living in the days when it's almost useless to learn ARM32 architecture. It's more a waste of time because the difference between actual and almost omnipresent ARM64 and rapidly becoming obsolete ARM32 is enormous.
Thus, we come to that point where we need a comfortable, big enough window (let's say — better a door or a nice gate) to the machine that represents modern ARM64 architecture. Usually it is done by programming on an emulator. Developing for an emulator is like a game because of its abstraction. Games (and gaming) give you minimal or even zero risks by the price of insignificant wins. To get real valuable practice and results, we need a real «Hello World» on real hardware.
Learning a machine architecture (and its behaviour) is done by so-called «Bare Metal» programming. «Bare» means that you are working on a machine without an OS, without a kernel, and even without a bootloader. Actually, you develop code that works instead of a bootloader or runs where and when the very first part of user bootloader (SPL — Second Platform Loader) usually does. «Metal» stands for hardware or machine and means that you work on a real hardware — not a virtual machine or an emulator.
In these topics we'll cut a good window for studying modern ARM64 machine. This will be the «Hello World» from embedded engineer and a nice toy for embedded developer.
Theory
Let's start with a little overview of a computing machine, specifically about its heart — CPU or SoC. SoC provides a set of functions (functionalities); each functionality is provided by an according and separate hardware-block. Thus, SoC consists of hardware-blocks or represents a set of hardware-blocks. Each hardware-block is driven by a clock; each clock is derived from PLL, Phase Lock Loop — a small circuit that generates one or more frequencies from an input clock source. Each PLL is driven by an oscillator. So, turning a hardware-block on, besides powering it, is just enabling the clock it is connected to. Actually, you will not find such element like «clock» on your board. «Clock», as an element — is just a conception of embedded developers. Really, «clock» — is one of the outputs of PLL, which can be enabled or disabled by software. Real clocking sources are oscillators and PLLs. You can see clock generating scheme below.
![]() |
Clock generating scheme |
But enabling clock is not enough to start using functionality of hardware-block. In most cases there are two more things. The first — usually, SoCs provide more functionalities than they have pins, or, in terms of embedded software, more often we say — pads. It leads to so-called «alternative function» (AF) conception. This means that some pads, as we say in software theory — can be configured to specific (alternative) function — I2C SDA or UART TX, for example. Configuring of AFs is done by Input/Output Multiplexing Controller (IOMUXC) or, more commonly called Pin Multiplexer (PINMUX). PINMUX is a built-in hardware-block. So, it has to be turned on (clocked) too, like other hardware-blocks. From the hardware point of view, PINMUX simply routes signals from SoC's pads to specified hardware-block inputs/outputs — to I2C bus controller or to UART controller in example above. In other words, PINMUX switches signals between hardware-blocks inside of a SoC on one end and SoCs ball (or pads) that we can see on the chip package. You can see alternative functions multiplexing scheme below.
![]() |
Alternative functions multiplexing scheme |
Second thing is configuration of hardware-block. This is done by code. After hardware-block is clocked (enabled) it stays in some default condition. We need to write specific values to specific registers to make it function according to our needs.
Clocks for hardware-blocks and corresponding PLLs are enabled by code. Hardware-blocks are configured by code too. But no program can run on a non-clocked CPU. So who and how starts the main PLLs and clocks that drive the ALU? This functionality is hardcoded into so-called BootROM, or BROM. BROM is microcode of SoC itself; it is located inside SoC. BROM initialises the essential hardware minimum required to start user machine code. It starts up a minimal amount of hardware-blocks and after that performs as FPL (see below). BROMs are very different in functionality — some of them just start code from built-in memory, not even loading it into RAM (like BROMs of MCUs do), some of them initialise a bunch of hardware-blocks and can even operate filesystems (like BROM of Raspberry's Broadcom does). The most common condition BROM leaves SoCs in is: one ALU core is started, SRAM (or OCRAM — OnChip RAM) initialised, and one of the boot sources initialised. Initialisation of the rest functionality is left for user code.
Any SoC uses its own, proprietary FPL (First Platform Loader). FPL is a part of BROM we've discussed earlier. To run our code on a SoC we need so-called «boot image», which is made out of our code. We need to say few words about contents of boot image. It is not a regular file we get from compiler. And the architecture mismatch is not the only reason. Usually compiler builds ELF (Executable and Linkable Format) file. ELF-file contains a lot of extra information which is used by OS or some other environment. Also ELF-file can include debug information, symbol names, etc. But while working on bare-metal we have to get rid of all OS-specific, debug and other environmental information. This is because SoC will execute raw code only and treat any binary data as straight, linear stream of code (with some addition of data). Bare-metal needs raw code to function correctly. If we try to load whole ELF-file as a bare-metal application, most probably we'll get some kind of illegal instruction exception because that extra information in ELF-file will not match with correct instructions codes. The process of making raw code, or raw binary from ELF-file is called stripping. Now, let's proceed with FPL. Any FPL expects boot-image in a specific format and in a specific place. Thus, after we've compiled our source code to (set of) object files, linked that objects into single binary, and stripped it down to pure machine code, we have to pack it and make it comply with specific requirements of a particular SoC and its BROM. This is done by a tool, which usually is named «mkimage», sometimes something like «mkimage_XXX», where XXX is replaced by the name of SoC or the name of a family of SoCs. Usually, mkimage adds some headers, which BROM reads, and sometimes CRCs, to machine code. BROM uses this information to verify the boot image. Then we have to put this image in a particular place — usually on a SD-Card or eMMC with some offset from zero. Offset is needed to make bootable media also usable as storage media — it leaves space for partition table and filesystems.
This is what we have about hardware. Let's proceed to software. There's not too much to do and discuss here. We can work in assembly language forever — this is interesting and may be useful. By limiting ourself to assembly language only we can omit using stack. But anyway, at some point we want to dive out into C-code. At this point we'll need stack because C-compiler will use it intensively. Thus, we have to initialise it — that's all we need to know at the moment. Initialisation of the stack is done by setting the stack pointer register to a value representing a valid memory address. Stack (most often) grows down, so we have to choose an address for it according to its behaviour — to prevent it from touching the bottom of RAM and from destroying our application in case it is set above. And, of course, we can't set it higher than the top of accessible RAM.
Now let's proceed to our specific task. Let's say we want to make our SBC (Single-Board Computer) output a «Hello World». How can we do that? Let's confess that outputting (drawing) strings on a display is a little complicated task for a bare-metal beginner. So we'll output our «Hello World» via the most common debug interface all over the world — UART. UART functionality is provided by a particular hardware-block. So we need to start clocking that hardware-block, set its pads and configure it.
First, to enable a clock for UART we have to know which clock exactly we need to enable on the exact SoC. As we mentioned earlier, every clock is derived from a particular PLL. Thus, we need to find out what PLL provides the clock we need. All this information is presented in a datasheet and is rigidly tied to a certain SoC. The second part is to set AF — to configure pads for UART — its TX and RX. Here we need to turn on PINMUX (PLL, clock) and configure pads we need to functions we need. And the last part — configuration is done by writing values to a memory location mapped to an address of hardware-block — PINMUX (pads) and UART (baud rate, parity, etc.) in our case. What configuration must be written is described in a datasheet and, again, is rigidly tied to a certain SoC.
After writing this code we have to compile, link and strip the resulting file to raw machine code, make boot-image by forming out BROM header and adding it to machine code we've got on the previous stage. Then put our boot-image to specific place on a specific media and power on SBC.
Theory ends at this point. In the next part, we'll choose specific SoC, gather information needed to design the «Hello World» for the chosen SoC. With this information we will form out the plan of steps for third part.