12/02/2025

The real «Hello World» from embedder (3): result, play!

Finally...

Well, we've discussed goals and benefits of bare-metal development, have gathered all the information we need to work on a specific SoC, and are ready to start writing code. In this post, we will practice. We will go thru the whole process of bare-metal coding, compiling, linking and, finally, will get a real stand-alone application for a real specific ARM64/ARMv8-A SoC. Can you imagine  it will be less than 1kB in size (including all necessary headers)! But this tiny application as a result, will perform the minimal task — output strings via UART. And even more — it will read from UART and output those bytes! Also we will add some functionality to our application, do some interesting stuff and perform some experiments.

The plan

The plan we need to carry out to reach the final result looks like:
1. Organise our code
— .s and .c files.
2. CPU start code
— assembly code.
3. Initialise the stack pointer
— assembly code.
4. Initialise PLLs, clocks
, pads, UART hardware-block — C code.
5. Write some payload code
— write to and read from UART.
6. Compile and link our code into binary file.
7. Make our binary acceptable by our SoC as boot-code.
8. Put our boot-block in a proper place.
9. Configure the board for booting from media we need.
10. Power on and play!

Hands-on code

Let's get it started.

1. This is bare-metal, we don't have bootloader and kernel behind us. Thus, there are some routines we have to do in assembly language prior any C-code. So, we will have assembly files. Actually two. They can be combined into a single .s-file, but we will stick to traditions. These files are (common names on 64-bit platforms) — start_64.s and crt0_64.s. But what exactly should be done in assembly files (language) on bare-metal? The answer is the things we can't do in C. We can access registers of hardware-blocks via volatile pointers as this is done via so-called memory-mapped input/output. But we can't access ALU registers in C directly, thus start_64.s file contains the very first code that cannot be done in C. Specifically CPU start-up code — applying erratas, switching modes, exception levels, setting interrupt vector tables, etc. Now let's go on with crt0. You probably are familiar with it (or, at least, heard of it) — so called «C-runtime zero». In user-space, usually, this is some object file, which is implicitly added to our code by linker prior to file containing main() function written in any high-level language. In user-space crt0 prepares argv and argc which are passed to developer's main() function, makes pointers to environment variables code starts with, etc. As you can assume, this file is unnecessary if we are about to write in assembly language only. Its content is to prepare the environment (be it a user-space program or bare-metal code) for C-code, in our case — set stack pointer to some valid address. After all necessary assembly code routines are done, we can break out to C-code. Today everything we've discussed in previous post will be done in C-code. Assembly language files are for startup code and stay as the window to ARMv8 architecture, our sandbox — one of our goals.

2. CPU start code. In this example we don't need to do anything here, but let's save some values for further research — (believe me) it'll be interesting to explore. According to ABI, x0 will be the first parameter of called function, x1 the second, x2 — the third. Let's save initial program counter, stack pointer and exception level while they are intact — at the very beginning of our code. We will save initial PC, SP and EL by storing them in x0, x1 and x2 registers. Later you will see how they will get into our main C-function and will be output to UART. So, start_64.s will look like:


.arch armv8-a .globl _start _start: # Store the address of the very first instruction. This will be the # address BROM puts our code to. According to ABI x0 will be the # first parameter of called function. Save current PC to x0: adr x0, . # Store initial stack pointer address as the second parameter: mov x1, sp # Save Current EL to x2, and it will get to third parameter: mrs x2, CurrentEL # Branch to crt0's main: b _main

3. In our case, there is the only thing we have to do in crt0 to get prepared for C-code — set stack pointer. Stack pointer is set by just writing memory address value to SP register. We want our x0, x1 and x2 registers to pass thru this file to C-function, so we avoid using these registers here. So, crt0_64.s will look like:


.arch armv8-a .globl _main _main: # Set stack pointer at the top of OCRAM 97FFFFh (p.706): # Move it to x3 without last Fh (aligned to 16): mov x3, 0xFFF0 movk x3, 0x0097, LSL #16 # Set stack pointer: mov sp, x3 # Finally, break out to C-code: b barium_main

4. After that, we have to initialise PLLs, clocks, pads and UART controller. I suppose the way we should do this on our SoC is explained well enough in previous post, so I will not spend time on describing it further here. You will see it in the code.

5. Payload code. Same as p.4 — since we have initialised all the hardware and have some high-level functions, the code becomes self-explanatory. The only thing I want to explain is the uart_chars_counter and UART_FIFO_MAX. On each output we increment special counter and when it reaches maximum length of FIFO we wait for real transfer to finish by waiting for TXDC bit in UART Status register.

6. Compile and link. Compilation and even cross-compilation is not something new — we compile separate source files, link them in a single ELF-file then dump pure code of it into binary file. You'll see a linker script in this repository. Actually, you can build the project without it. I've used it just to remove some unneeded section that GCC adds to binary. But we should bear in mind that this is bare-metal. That means that machine starts from the very first instruction it finds in our code. Actually, BootROM branches to the address it loads our application to. Thus, we have to put our start function in the very first place. This is done by respecting the order of object files when linking to resulting code. Thereby, we set the file containing our first code in a first place and the function we want SoC to start as the first function in that file.

7. To make binary acceptable by a SoC is very important step, but I'll not bother you with all that details. I've made host tool named mkbb_imx8. It is derived from NXP's mkimage_imx8. It gathers some information, fills necessary structures and writes them to proprietary header adding it to specified binary. The only interesting thing here is the address to load image to. This filed tells BootROM where to put loaded application (and jump to). In my tool it is defined as DEFAULT_LOAD_ADDR in source code or can be passed as third, optional parameter. Command line parameter takes priority over define. If you ever need to develop such tool for particular SoC, you can follow the same way as I have made. No tricks here, all you need — just to find the structure of proprietary header for specific SoC. Usually this information can be found in vendor's BSP build system — host tool, often named mkimage_XXX. You will see mkbb_imx8 code in separate repository (https://gitlab.com/daftsoft/mkbb_imx8).

8. The result of mkbb_imx8 is a binary file which will be accepted by iMX8MP BootROM. The last thing we have to do on our host machine is to put our boot-block in a proper place. iMX8MP expects it to be on media with offset 32kB. Thus, we will write our application to SD-card with this command:


dd if=./barium of=/dev/sdX bs=1k seek=32 ; sync

I said it'll be less than 1kB in size. The code (with its data) is 949 bytes, and it appeared to be 1013 bytes with iMX headers. Maybe we've made one of the most beautiful kilobytes all over the world.

9. If you are reading this, perhaps, you have some board (proprietary or kit) with iMX8MP SoC. If so, you may already have it configured to boot from SD-card or already know how to do it. If not, you need to configure the board for booting from SD-card. From iMX8MP SoC perspective boot device selection is described in section 6.1.5 — «Boot devices (internal boot)» (p. 713). Refer to your product manual to see how it is done on a particular board.

10. Insert SD-card into your board, connect your UART converter to UART2 interface of the board, start your favorite serial communication program and power on (or reset) the board. You'll see:


Barium No-Boot (iMX8MP) Build: 18:00:00, Feb 12 2025 Initial PC: 0000000000920000 BootROM SP: 0000000000916ED0 Current EL: 0000000000000003

What's about naming? «Barium» stands for «Bare», because barium, besides consonance with «bare», is a metal element. Thus bare-metal is replaced with «Barium». «No-Boot» means it does not boot any kernel or anything at all.

Well, what we see here. First, initial PC is 920000h. This is our DEFAULT_LOAD_ADDR (in my Makefile address is passed via command line argument). You can play with it, set it here and there according to OCRAM memory map (p. 707). Datasheet claims that OCRAM free area starts from 918000h, but the lowest address I could put application to is 920000h. BootROM does not allow to put it to 918000h.

But what is more interesting — stack pointer BootROM leaves for us. It does not allow to place our application in its reserved area, but leaves SP at 916ED0h. This can be used — we can leave SP intact and have stack from 916ED0h to 900000h which is pretty enough space and is actually restricted for our code by BootROM unusable anyway.

And we see that we run in third exception level, EL3 — the highest one. We have carte blanche to do whatever we want on this machine.

What conclusion can be drawn from this? First Platform Loader (BootROM) of iMX8MP leaves SoC in condition where we can omit any initialization which is usually done in assembly language — as we mentioned earlier, stack pointer is already set to some valid address. While setting stack pointer is the only crucial thing we have to do in assembly code for such simple example, we can skip this step by excluding all .s-files. Nothing else should be changed, just don't forget to set main .c-file (in our case — barium.c) as the first one in your objects list. And main function (in our case — barium_main()) should be the first function in this file. Keep in mind — this is bare-metal, and the first function in your code is the entry-point regardless what ENTRY() linker directive points to. (You won't even see it in my .lds-file). Despite our goal was exact opposite, the whole bare-metal project can work without a single .s-file on iMX8MP! You can try it. Thus, we can call iMX8MP very high level SoC, or some kind of a «C-SoC».

You can clone the final repository from https://gitlab.com/daftsoft/barium-no-boot

That's all. Finally we have a good window to learn ARMv8 aka ARM64 aka Aarch64 on a real hardware — we have assembly files and C-files to play with, we have build system for this project — Makefile, linker script, we have host tool to make proper boot image, and we have some kind of a debug interface — UART. That looks like sufficient setup for further studying.

16/01/2025

The real «Hello World» from embedder (2): practice, prepare

Between theory and code

In the previous post we've reviewed bare-metal development: what it is, the theory of it, its workflow, what information we need and where to get this information, its profits, bottlenecks and limitations. This — how SoC is organised and it works, most of all somehow know from theory (school or university) and/or practice (working on a high level or via some HAL). The next post will describe the well known to all of us process — coding. But what is between that areas of theory and practice? What is between the knowing how PLLs and clocks work, knowing that UART is configured by writing some values to some registers and process of writing device-tree nodes and calling functions of HAL? How does that magic of making certain SoC functioning really arise? The process of preparing to code is described in this post — the practice of getting information, gathering it and planning jobs. Excuse me for not feeling sorry for you — not a single line of code will be written in this post, but I will describe this (middle) part of job up to the single bit. This is to let you know clear enough how it is done.

The plan

The plan we need to carry out to reach our goals looks like:
1. Choose the SoC we will work on and get its documentation.
2. Find out the condition BROM leaves SoC in, and see what is initialised for us and what is not
.
3. Find out what exactly PLLs and clocks we have to configure to start up hardware-blocks.
4. Configure UART TX and RX pads.
5. Configure UART hardware-block.
6. Output some string.
7. Finish with infinite loop outputting characters received from UART.
8. Make proper boot image and put it in the place BROM of our SoC expects it to be.

The practice

Let's get it started.

1. We will work on NXP i.MX 8M Plus (iMX8MP). This is multi-core (mine is quad-core) ARM Cortex-A53 (ARMv8-A), SoC (with additional Cortex-M7 core). Some kind of what we need and interesting to play with. Remember and bear in mind what we were talking about in previous post — Bare-Metal development is strongly tied to certain SoC. Thus, if you are about to develop stand-alone application for any other SoC, then the practice, we will do in this post, is not for you. You can read it as an example of workflow only. Once, we have chosen SoC, we download datasheet describing it. In our case it is «i.MX 8M Plus Applications Processor Reference Manual» (IMX8MPRM.pdf, I have Rev. 3, 08/2024).

2. To find out the condition BROM leaves SoC in we look for section, which describes what BROM enables and what does not. This is section 6.1.4.2 «Boot block activation» (p. 706). This section claims that BROM of iMX8MP activates (in addition to some others) these blocks: Clock Control Module (CCM), Input/Output Multiplexer Controller (IOMUXC). We will boot from SD-Card, thus Ultra-Secure Digital Host Controller (USDHC) will be enabled also (but this is obvious). Let's proceed to clocks BROM has initialised for us. In the next section 6.1.4.3 «Clocks at boot time» (p. 707), Table 6-3 «PLL setting by ROM» we see which PLLs are enabled: ARM PLL at 1GHz, System PLL1 at 800MHz and System PLL2 at 1GHz.  Let's remember PLLs we have enabled: System PLL1 and System PLL2. We don't care about ARM PLL at this point, as it controls ALU only and is enabled and configured already. Proceed to Table 6-6 «CCGR setting by ROM» (p. 708), and see which clocks are enabled and which are not. Scrolling down to UARTs (p. 710) and see that BROM enables none of UARTs. iMX8MP boards usually use second UART for debug, thus, let's remember that its clock number is 74.

3. From the previous paragraph we see that none of UARTs hardware-blocks is enabled by BROM. Well, we have to find out how to enable it by ourselves. Let's start with CCM structure, it is described in section 5.1 «Clock Control Module (CCM)» (p. 227). Looking at Figure 5-1 «CCM Block Diagram» (p. 228) we see that clock ticks pass from clock generators (on the left side) via PLLs (or bypassing them), then to CCM's Clock Root Generator which has clock slices, then clock slices form out clock roots and, finally, come out to hardware-blocks (on the right side). Well, this scheme looks more complicated than that we've discussed in the previous post. The idea of Clock Roots becomes more clear if we look at 5.1.2 «Clock Root Selects» (p. 228). Let's scroll down to Slice Index №95 (p. 241). 95 is the slice Clock Root of UART2. In the column «Source Select» we see that it can be driven by few outputs. We will drive our UART by SYSTEM_PLL2_DIV5. Let's remember its value 010b. As we know already, System PLL2 is enabled at 1GHz. Here we need to configure its outputs — ensure its DIV5 (1GHz div 5 is 200MHz — we'll need this value later) output is enabled. This is done by configuring System PLL2 General Function Control Register which is described in section 5.1.8.32 «SYS PLL2 General Function Control Register» (p. 509). We will set all PLL_DIVx_CLKE bits and PLL_CLKE. The address of this register is ANAMIX base + 104h. After we have enabled PLL outputs we have to select proper clock root for UART2 hardware block. This is done by configuring CCM_TARGET_ROOT №95. It is described in section 5.1.7.10 «Target Register (CCM_TARGET_ROOTn)» (p. 412). We see that here we need to set enable (28th bit) to 1 and MUX (24-26 bits) to the value we've remembered earlier 010b. The address of this register is CCM base + 8000h + 95 (slice index we need) * 80h. The resulting value we have to write to the register is 12000000h.

4. Well, PLLs and clocks are configured and enabled. Now we have to find out how to configure UART pads. First, let's set proper AF for our UART. Alternative functions are described in table 8.1.1.1 «Muxing Options» (p. 1287). Let's scroll to UART2 (p. 1307). Here we see that UART2_RX port can be routed to one of these pads: UART2_RXD, SD2_DATA0, SD1_DATA3 and SAI3_TXFS. The first one is what we need. UART2_TX port can be routed to one of these pads: UART2_TXD, SD2_DATA1, SD1_DATA2 and SAI3_TXC. The first one is what we need. Both UART2_TXD and UART2_RXD have mode called ALT0. Let's proceed to section 8.2.4 «IOMUXC Memory Map/Register Definition» (p. 1344). In this table we need to find our UART2_RXD and UART2_TXD they are on the bottom of page 1350 and on the top of page 1351 correspondingly. Here we see that their reset values both are 5h (remember that value for a while) and absolute addresses are 30330228h for UART2_RXD, and 3033022Ch for UART2_TXD. Then click on the link in the right column. From sections 8.2.4.134 «SW_MUX_CTL_PAD_UART2_RXD SW MUX Control Register» (p. 1540) and 8.2.4.135 «SW_MUX_CTL_PAD_UART2_TXD SW MUX Control Register» (p. 1542) we see that MUX_MODE is represented by lowest 3 bits of this registers. Also, we see that 5h (the value we've remembered recently) corresponds to 101b. That means that both pads we need are routed to pads we don't need  GPIO5 24 and GPIO5 25 in this case. Thus, we have to configure those pads correctly for our needs — set both to zero (ALT0). To achieve that, we need to write zeroes to 30330228h and 3033022Ch to set proper alternative functions for that pads. But that's not all we have to do to make UART pads functioning correctly. In addition to setting AF, we need to configure physical parameters of that pads. This is done by setting two SW_PAD_CTL Registers: UART2 RXD pad control register, section 8.2.4.286 «SW_PAD_CTL_PAD_UART3_RXD SW PAD Control Register» (p. 1837) and UART2 TXD pad control register, section 8.2.4.287 — «SW_PAD_CTL_PAD_UART2_TXD SW PAD Control Register» (p. 1839). After inspecting these description, we conclude that zero is a good value for both of them. And the last one step we have left to do. UART RX is a little special, because it works as an input function. Thus, we need to select input for it. This is done by configuring DAISY Register, which is represented in section 8.2.4.376 «UART2_UART_RXD_MUX_SELECT_INPUT DAISY Register» (p. 1922). Here we see that 110b «SELECT_UART2_RXD_ALT0 — Selecting Pad: UART2_RXD for Mode: ALT0» (p. 1923) is what we need. Thus, we'll write 6h to 303305F0h.

5. The last thing we have left is to configure UART hardware-block. It is done by familiar steps like — reading registers values (optional), modifying that values (optional) or forming out them from scratch, writing values to registers, waiting for conditions flags (optional). Actually, UART is a simple hardware-block, thus, I will not explain the specific process of configuring it — you will see it in the code, which will be presented in the next post.

6. After UART is configured and running, the game starts. Now we are prepared and ready to output some strings. This is also done by writing a byte to some address (UART register) and controlling TX empty flag to avoid buffer overrun. Here I'll skip detailed description of this process too — see it in the code.

7. We will finish with infinite loop outputting characters received from UART. This is done by controlling RX empty flag and reading received byte (UART register) when flag becomes unset.

8. To make our application load and run on the SoC we've chosen, we have to prepare proper boot image and put it in the place BROM of our SoC expects it to be. This is done by a tool (mkbb_imx8), which is derived from standard NXPs mkimage_imx8. I will not explain how it works and how it was developed at all, but will show how to use it to generate boot block (and how and where to place it) for our SoC in the next post.

Conclusion

We've made it. Now we have gathered all the information we need to start writing code for iMX8MP SoC and we are ready to proceed. And now you know what lies between the theory and the daily routine of embedded developer. In the next post we will develop the application — we will write in an assembly language and C-code, compile, link objects to binary, make a bootable image of it, and put it in the right place on a storage media. It'll be a very small program that will run on the iMX8MP and on this SoC only. But it will give a platform for learning ARM64 machine. You'll be able to play with the ARMv8 machine from the ground, as in assembly language as in C-code — start (kick) or not start its cores, switch or not switch exception levels, output values of registers, and so on. Finally, we'll have a wide-open window to ARM64 machine!

17/12/2024

The real «Hello World» from embedder (1): preface, theory

Preface

Any learning process eventually comes up to some examples, some practice. In the case of software development, practice starts from so-called «Hello World». «Hello World» is the very first code every software developer on every platform, API, or framework should do on his very first days or steps. Embedded software engineers are not an exception.

There is a lot of «Hello Worlds» out there on the Internet and in books. Those «Hello Worlds» are in any kind of programmings languages, APIs, and frameworks. But the vast majority of those «any kinds» are in pure software — Java, JS, C++, etc. Thus, embedded software developers, those of them who want to study machine architecture, get in touch and play with it, suffer from a lack of examples; thus, they have no starting point.

You can argue with that — there are examples in assembly language for all types of architectures, assembly language should let us study machine architecture and its behaviour. Yes, there is a lot of examples in assembly language, but those examples are for very high (let's say — top) levels of runtime environment — Linux and Mac user-space. This approach to learning machine architecture doesn't lead to understanding of how it works and how it is organised. 

Actually, it gives us a very narrow slit to machine architecture. Even within kernel-space, we are very limited in abilities to learn the machine. That's because all the hardware initialisation is already done for (and before) us by bootloader and kernel itself. The second problem is so-called «conceptions of operating systems». Those «conceptions» are hiding a lot of machine architecture from us and arise a lot of software. Thus, working even in kernel-space, we learn more of those «conceptions» and principles rather than a machine.

Let's go on and discuss our last chance — MCUs, which are cheap, popular, easy to start up and do let us be as close to the machine as possible. MCUs always had a crucial difference with CPUs and SoCs — they have no MMU and usually are single-core ALUs. After the beginning of the ARM64-era — the times when not only phones and tablets but also desktops (and even servers) and laptops are running on ARM64, MCUs eventually have got another one drawback ARM32 (the most common architecture of worth anything at all MCUs) architecture became outdated even as a single-core platform for studying. Now we are living in the days when it's almost useless to learn ARM32 architecture. It's more a waste of time because the difference between actual and almost omnipresent ARM64 and rapidly becoming obsolete ARM32 is enormous.

Thus, we come to that point where we need a comfortable, big enough window (let's say — better a door or a nice gate) to the machine that represents modern ARM64 architecture. Usually it is done by programming on an emulator. Developing for an emulator is like a game because of its abstraction. Games (and gaming) give you minimal or even zero risks by the price of insignificant wins. To get real valuable practice and results, we need a real «Hello World» on real hardware.

Learning a machine architecture (and its behaviour) is done by so-called «Bare Metal» programming. «Bare» means that you are working on a machine without an OS, without a kernel, and even without a bootloader. Actually, you develop code that works instead of a bootloader or runs where and when the very first part of user bootloader (SPL — Second Platform Loader) usually does. «Metal» stands for hardware or machine and means that you work on a real hardware — not a virtual machine or an emulator.

In these topics we'll cut a good window for studying modern ARM64 machine. This will be the «Hello World» from embedded engineer and a nice toy for embedded developer.

Theory

Let's start with summarising of what we already know about a computing machine, specifically about its heart — CPU or SoC. We know that SoC provides a set of functions (functionalities); each functionality is provided by an according and separate hardware-block. Thus, SoC consists of hardware-blocks or represents a set of hardware-blocks. Each hardware-block is driven by a clock; each clock is derived from PLL; each PLL is driven by an oscillator. So, turning a hardware-block on, besides powering it, is just enabling the clock it is connected to. Actually, you will not find such element like «clock» on your board. «Clock», as an element — is just a conception of embedded developers. Really, «clock» — is one of the outputs of PLL, which can be enabled or disabled by software. Real clocking sources are oscillators and PLLs.

But enabling clock is not enough for starting using functionality of that hardware-block. In most cases there are two more things. First. Usually, SoCs provide more functionalities than they have pins, or, in terms of embedded software, more often we say — pads. It leads to so-called «alternative function» (AF) conception. This means that some pads, as we say in software theory — can be configured to specific (alternative) function — I2C SDA or UART TX, for example. Configuring of AFs is done by Input/Output Multiplexing Controller (IOMUXC). IOMUXC is a built-in hardware-block. So, it has to be turned on (clocked) too, like other hardware-blocks. From the hardware point of view, IOMUXC simply routes signals from SoC's balls to specified hardware-block inputs/outputs — to I2C bus controller or to UART controller in example above. In other words, IOMUXC switches signals between hardware-blocks on one end and SoCs ball on the other. Second thing is configuration of hardware-block. This is done by code. After hardware-block is clocked (enabled) it stays in some default condition. We need to write specific values to specific registers to make it function according to our needs.

Clocks for hardware-blocks and corresponding PLLs are enabled by code. Hardware-blocks are configured by code too. But no program can run on a non-clocked CPU. So who and how starts the main PLLs and clocks that drive the ALU? This functionality is hardcoded into so-called BootROM, or BROM. BROM is microcode of SoC itself; it is located inside SoC. BROM initialises the essential hardware minimum required to start user machine code. It starts up a minimal amount of hardware-blocks and after that performs as FPL (see below). BROMs are very different in functionality — some of them just start code from built-in memory, not even loading it into RAM (like BROMs of MCUs do), some of them initialise a bunch of hardware-blocks, DDRC, and can even operate filesystems (like Raspberry's Broadcom does). The most common condition BROM leaves SoCs in is: one ALU core is started, SRAM (or OCRAM — OnChip RAM) initialised, and one of the boot sources initialised. Initialisation of the rest functionality is left for user code.

Any SoC uses its own, proprietary FPL (First Platform Loader). FPL is a part of BROM we've discussed earlier. To run our code on a SoC we need so-called «boot image», which is made out of our code. Any FPL expects it in a specific format and in a specific place. Thus, after we've compiled, linked, and stripped our source code into pure machine code, we have to pack it and make it comply with specific requirements of a particular SoC and its BROM. This is done by a tool, which usually is named «mkimage», sometimes something like «mkimage_XXX», where XXX is replaced by the name of SoC or the name of a family of SoCs. Usually, mkimage adds some headers, which BROM reads, and sometimes CRCs, to machine code. BROM uses this information to verify the boot image. Then we have to put this image in a particular place — usually on a SD-Card or eMMC with some offset from zero. Offset is needed to make bootable media also usable as storage media — it leaves space for MBR and filesystems table.

This is what we have about hardware. Let's proceed to software. There's not too much to do and discuss here. We can work in assembly language forever. This is interesting and may be very useful. But anyway, at some point we want to dive out into C-code. Machine does not need conceptions that humans use. But writing in C requires one, it's called — «stack». We have to initialise it — that's all we need to know at the moment. Initialisation of the stack is done by setting the stack pointer register. Stack grows down, so we have to choose an address for it according to its behaviour — to prevent it from touching the bottom of RAM and from destroying our application. And, of course, we can't set it higher than the top of accessible RAM.

Now let's proceed to our specific task. Let's say we want to make our SBC output a «Hello World». How can we do that? Let's confess that outputting (drawing) strings on a display is a little complicated task for a Bare-Metal beginner. So we'll output our «Hello World» via the most common debug interface all over the world — UART. UART functionality is provided by a particular hardware-block. So we need to start clocking that hardware-block, set its pads and configure it.

First, to enable a clock for UART we have to know which clock exactly we need to enable on the exact SoC. As we mentioned earlier, every clock is derived from a particular PLL. Thus, we need to find out what PLL provides the clock we need. All this information is presented in a datasheet and is rigidly tied to a certain SoC. The second part is to set AF — to configure pads for UART — its TX and RX. Here we need to turn on IOMUXC (PLL, clock) and configure pads we need to functions we need. And the last part — configuration is done by writing values to a memory location mapped to an address of hardware-block — IUMOUXC (pads) and UART (baud rate, parity, etc.) in our case. What configuration must be written is described in a datasheet and is rigidly tied to a certain SoC.

Theory ends at this point. In the next part, we'll choose specific SoC, gather information needed to design the «Hello World» for the chosen SoC. From this information we will form out the plan of jobs for third part.

28/07/2024

Kernel-Space: asm _ko Hello!

Long time no see, confreres.

This time we will do some crazy (or maybe even mad) things.

As we all know, Linux kernel is a cross-platform software. It supports a very large variety of hardware platforms. Thus it’s written in the most “inter-platform” language in the world — Plain C. The secret of cross-platformness is a compiler used to build the kernel itself and it’s modules. Exceptions are small parts that provide the only hardware specific functionality that can not be written in C — SoCs/CPUs starting and configuring routines. These small parts of kernel are written in assembly languages.


But today I’ll present some sort of skeleton of the whole kernel module written in assembly language — not a single .c-file is used. I had this idea for a long time, I’ve googled a lot of times for some examples or at least a starting point or a discussion. But had no luck — no examples for ARM nor x86 or any other architecture. This fact proves the craziness of my idea (possibly also it explains why you should not do this in your practice). But I’ve managed to do this.


This module’s functionality will be limited to some kind of “Hello World” — it will be able to be loaded and unloaded correctly and do the only thing — print messages on these events. I’ve mentioned that no .c-files are used in this module, but this module looks like a usual kernel module (except the language it is written in) and is built and works like any other module — thus it is a regular code which you can work with like your everyday routine — no any sort of magic nor discomfort. We will talk about ARM64 (or AArch64) but it can be easily rewritten for any other architecture.


I said that there is no magic in this module, but… we know that any object file must have some sections, information and has to be built — compiled and linked according to strict rules. Kernel module is no exception. That was the “magic” I had to reveal to achieve the goal of my idea — regular-looking and developer-friendly kernel module in assembly language.


There is a lot of “magic” in kernel build system. But my idea was to write a template of a kernel module that will look, feel, act and work like a regular one, as regular part of kernel source tree. Thus today we will not talk about kernel’s build system, differences between linking user-space objects and linking kernel-space objects — but about template of a kernel module in assembly language only (and Makefile for it, of course). Building — compiling and linking will be done by standard kernel build system.


Kernel module project consists of two parts — Makefile and source code. Let's start with Makefile. Everything is clear enough here — you just set source file name to your assembly file and set obj-m variable. That's all — the rest of the job will be done by kernel build system. Here is our Makefile:


ifeq ($(KERNEL_SRC),)
$(error Specify KERNEL_SRC directory)
endif

export ARCH := arm64
export CROSS_COMPILE ?= aarch64-linux-gnu-
PWD := $(shell pwd)

PROJECT_NAME := asm_ko_hello
$(PROJECT_NAME)-src := $(PROJECT_NAME).S
obj-m += $(PROJECT_NAME).o
AFLAGS_$(PROJECT_NAME).o := -DPROJECT_NAME=$(PROJECT_NAME)

all:
	make -C $(KERNEL_SRC) M=$(PWD) modules

clean:
	make -C $(KERNEL_SRC) M=$(PWD) clean


Now, let's have a look at asm_ko_hello.S:

#include "linux/kern_levels.h"

Here you can see something familiar to kernel modules you have worked with and guess that we will have standard levels of output and you are right — we will have all that standard KERN_XXXX output levels in our assembly code.

#if !defined (PROJECT_NAME)
	#error You must define project name for this template. Stopping build.
#endif

#define MAKE_FN_NAME(x, y) x##_##y
#define FN_NAME(project, func) MAKE_FN_NAME(project, func)


The above lines serve project's template and are not related to today's topic.

.section .text
FN_NAME(PROJECT_NAME, init):
	stp x29, x30, [sp, -16]!
	adrp x0, .loaded
	mov x29, sp
	add x0, x0, :lo12:.loaded
	bl _printk
	mov w0, 0
	ldp x29, x30, [sp], 16
	ret

FN_NAME(PROJECT_NAME, exit):
	stp x29, x30, [sp, -16]!
	adrp x0, .unloaded
	mov x29, sp
	add x0, x0, :lo12:.unloaded
	bl _printk
	ldp x29, x30, [sp], 16
	ret


This is the code of common "Hello World" kernel module. It is put into standard .text section. FN_NAME macro will produce functions names. And after go two functions bodies in standard/usual ARM64 assembly language. As you can see we preserve registers, load string address and call (in ARM assembly it's called "branch with link") printk() function, restore registers and, in case of _init(), return value (which actually is an abstraction).

The code looks clear and familiar, but this is kernel module and there is a little difference with user-space program. We should let the system know where are the entry and leave points (or load/unload functions) of our module. In C it's done by two macros module_init() and module_exit(). In our case it would look like:

module_init(asm_ko_init);
module_exit(asm_ko_exit);


But how should we specify these functions in assembly language? What's covered under those macros? Actually nothing too complicated. We just need to declare global functions (symbols) init_module and cleanup_module. To give them proper payload (symbol itself is just a symbol in object file) we specify aliases to our _init() and _exit() functions with .set directive. The whole part of this code is in snippet below:

.global init_module
.global cleanup_module
.set init_module, FN_NAME(PROJECT_NAME, init)
.set cleanup_module, FN_NAME(PROJECT_NAME, exit)


We can't omit the .data section. This one is absolutely standard. Here is our section with strings used for our output:

.section .data
.unloaded:
	.string KERN_INFO MODULE_NAME": unloaded.\n"

.loaded:
	.string KERN_INFO MODULE_NAME": successfully loaded.\n"


There is something that the kernel will not accept our module without. This is something new for developers working in user-space and familiar for kernel-space developers. For Linux kernel we have to specify one necessary parameter that can not be omitted — the license. In C it is done by MODULE_LICENSE() macro and would look like:

MODULE_LICENSE("GPL");


Let's see how it is done in assembly language. Maybe you've expected something serious here — some special codes or sequences. But it's easy too — this is just .modinfo section containing information about module in a very simple (and unexpectedly ridiculous) format. See self-explaining snippet below:

#define MODULE_NAME	"Kernel Module in Assembly"
#define MODULE_VER	"1.0"
#define MODULE_AUTHOR	"Timofey Chernigovskiy, 2024"

.section .modinfo, "a"
	.string "author=" MODULE_AUTHOR
	.string "version=" MODULE_VER
	.string "description=" MODULE_NAME
	.string "license=GPL"


That's all about code. The link is here. According to our Makefile, if you cross-compiler is aarch64-linux-gnu-, you build module as follows:

make KERNEL_SRC="PATAH/TO/YOUR/KERNEL/lib/modules/VERSION/build" clean all


Test it:

modinfo ./asm_ko_hello.ko  
filename:       ./asm_ko_hello.ko 
author:         Timofey Chernigovskiy, 2024
version:        1.0
description:    Kernel Module in Assembly
license:        GPL
srcversion:     48259120F9222D4D7B9D8E7
depends:         
name:           asm_ko_hello
vermagic:       6.1.55 SMP preempt mod_unload modversions aarch64

insmod ./asm_ko_hello.ko  
Kernel Module in Assembly: successfully loaded. 

lsmod 
Module                  Size  Used by 
asm_ko_hello           16384  0 

rmmod asm_ko_hello
Kernel Module in Assembly: unloaded.



P.S.
High level programming languages do a lot of job for developer and prevent a lot of mistakes. Actually not prevent but don't allow — you don't have enough tools to do most dummy things. When you write in assembly language in user-space it's a risk, but it's a funny walk in comparison to kernel-space. So you have to be extremely cautious working in assembly language in kernel-space because your any typo will be compiled and executed. For example, slight shift of stack pointer may lead to huge troubles — broken file system is a case. You have been warned — it's your decision how hard you wanna play.

18/10/2022

Оптимизация кода (1): О вызовах функций

Недавно я увидел следующую рекомендацию по рефакторингу кода с целью оптимизации:

Изначальный вариантОптимизированный вариант
return pow (base, exp/2) * pow (base, exp/2);let result = pow (base, exp/2); return result * result;

В книге было написано, что результатом такого рефакторинга будет увеличение скорости работы представленного участка кода. Но почему такой рефакторинг должен привести к ускорению работы этого кода? Большинство авторов/советчиков/специалистов не раскрывают суть своих советов и не дают понимания почему что-либо нужно делать так или иначе. Если вам интересно действительно ли здесь произошло ускорение работы и за счёт чего, то читайте — эта заметка для вас. В ней я рассмотрю, что происходит на вычислительной машине в данной ситуации, чем достигается результат, поясню как именно происходит оптимизация и дам информацию для некоторого понимания «масштаба» этой оптимизации. Пояснения в этой заметке раскрывают суть подобной оптимизации не только в данном случае — а в общем. Вы сможете применять это во множестве подобных ситуаций в вашей практике.

По синтаксису можно предположить, что представленный участок кода написан на JavaScript'е. Я же буду работать на общепринятых языках ИТ/ВТ — С, Aseembler x86_64 и стандартном ABI x86_64.

Для начала мы видим, что автор, ценой добавления ещё одной переменной, сократила вызовы функции pow () в два раза. Рассмотрим, что происходит при вызове функции pow (). Её прототип выглядит следующим образом:

double pow (double x, double y);

Из прототипа понятно, что функция работает с параметрами с плавающей точкой и возвращает такой же результат, а это значит, что она будет задействовать сопроцессор (FPU).

Сокращение вызовов функции pow () действительно даёт оптимизацию, и вот почему. Вызов любой функции требует ощутимых накладных расходов. Рассмотрим, что происходит на центральном процессоре при вызове функции в общем, и в частности при вызове функции pow (). В скобках, рядом с командами процессора, стоит приблизительное количество «циклов» (микроопераций) процессора на команду. Рассматриваем на примере x86-ой машины:

1. (При необходимости) вычисление или каким-то иным образом получение всех входных параметров, передаваемых функции (в данном примере это вычисление выражения exp/2). Здесь возможна работа с памятью (а она медленная) и вызовы других, вложенных функций. Если они есть, то стоит пройти все их начиная с этого же пункта.

2. «Раскладывание» параметров, вычисленных или полученных на предыдущем этапе в соответствии с ABI. После вычисления их, они могут оказаться в «местах» (память, регистры), не соответствующих сигнатуре рассматриваемой функции. Это некоторое количество команд mov (~1).

3. Вызов команды call (~10), которая, в свою очередь, сохраняет адрес возврата в стеке, затем осуществляет переход потока выполнения на (указанный адрес) рассматриваемой нами функции.

4. Сама функция pow (), как и любая другая, имеет в себе «преамбулу» (или «пролог»). В этой секции кода происходит сохранение регистров общего назначения путём выполнения нескольких команд семейства push, резервирование пространства в стеке для внутренних нужд функции несколькими манипуляциями с указателем стека. Напомним, что всё это уже происходит один раз в нашей, вызывающей функции.

5. Так как pow () работает с числами с плавающей точкой, она должна сохранить регистры и состояние сопроцессора — так называемую «среду сопроцессора», включающую в себя стек сопроцессора, регистры его состояния, управления, тегов, указатели данных и команд. Так же функция должна сбросить состояние FPU при входе, так как неизвестно в каком состоянии он будет к этому моменту. К счастью у нас есть простые команды для сохранения всей среды сопроцессора — fsave/fnsave (120-150 и более) и finit/fninit (15-20 и более) — для сброса его состояния. Как эти, так и описанные ниже команды восстановления среды FPU, достаточно тяжёлые, так как работают с достаточно большим объёмом информации — всей средой сопроцессора и с памятью (а она, не забываем, у нас — медленная).

6. pow (), вероятно, имеет множество проверок и инициализацию локальных переменных в своей реализации, а они нам здесь не нужны.

7. Реализация самих вычислений функции. При работе с FPU, есть вероятность необходимости вызова команды fwait — дождаться выполнения операции на FPU (на x86 FPU работает параллельно с CPU).

8. После завершения своей работы, функции нужно восстановить регистры FPU и его состояние (что так же уже, возможно, делается в нашей функции). Это команда frstor (~100).

9. В конце функции есть «эпилог», включающий в себя восстановление состояния стека — некоторые манипуляции с указателем стека и регистров общего назначения CPU при выходе — выполнение нескольких команд семейства pop.

10. По завершению вызывается команда ret (15-25), которая извлекает адрес возврата из стека и изменяет указатель текущей инструкции процессора на этот адрес. Таким образом, модуль выборки команд x86, после окончания обработки команды ret, начнёт загружать и выполнять команды с адреса, хранящегося в указателе текущей инструкции.

На ARM'е накладные расходы будут схожими.

Таким образом, мы видим, что основные накладные расходы, исключением которых достигнута оптимизация в этом примере, заключаются в управлении потоком исполнения, исключении некоторых проверок, встроенных в вызываемую функцию (что, в данной ситуации, нужно лишь для вычисления значения, которое уже есть).

Но есть и ещё один нюанс. Мы никогда не знаем сколько раз операционная система переключит контекст нашего приложения (в данной ситуации — функции), пока оно работает. А любая, современная операционная система работает по принципу вытесняющей многозадачности. Следовательно, чем меньше список команд, требуемый для получения результата — тем больше вероятность, что наш код будет прерван меньшее количество раз — наша задача будет делиться на меньшее количество квантов выделяемого времени. В критических ко времени вычислениях, когда результат зависит от внешних обстоятельств, например — поток данных, приходящих по сети, такая оптимизация может глобально улучшить результат работы приложения (либо и вовсе спасти код/алгоритм от признания непригодным).

Из этого можно сделать вывод, что такая оптимизация особенно желательна в критичных местах — в «верхних половинах» kernel-space (там лучше вообще не делать никаких вычислений, но мы рассматриваем оптимизацию через сокращение количества вызовов функций в общем), в обработчиках прерываний на Bare-Metal/RTOS.

Теперь опишем, что происходит в оптимизированном варианте. Умножая, в данной ситуации, число само на себя, без вызова функции, мы всего лишь пишем одну команду из семейства fmul, которая в упрощённом варианте (при правильном использовании) умножает два верхних регистра FPU и кладёт результат в самый верхний — ST(0). Который по правилам ABI x86_64, в свою очередь, является возвращаемым значением функции, работающей с числами с плавающей точкой. А если мы пойдём дальше, и сами будем писать код этих операций, то мы сделаем так, чтобы к моменту этого умножения, у нас операнды были уже в двух верхних регистрах FPU, в правильной последовательности. Вот какая оптимизация происходит — практически автоматическая работа процессора на нас.

Кто-то может возразить, что x86-ая машина имеет множество оптимизаций и, что, благодаря им, многие операции из рассмотренного участка кода будут выполнены в параллель, на опережение, или вовсе опущены. Да, это так. Но зачем нам грузить вычислительную машину лишними операциями, когда мы можем избежать их на этапе написания кода? Пусть она использует свои возможности для оптимизации чего-то другого.

P.S.
По описанным выше причинам, предостерегу вас от достаточно очевидного манёвра с этим участком кода. А именно, использовать вместо умножения вызов всё той же функции pow () с result в качестве первого параметра и двойкой — в качестве второго. Умножение числа на само себя — есть возведение его в квадрат. Математически и идеологически, кому-то это может показаться правильнее или «красивее», но технически — правильнее так, как мы рассмотрели. В противном случае это будет «пример плохого кода».

P.P.S.
А вот, например, сократить количество машинных операций, передавая в функцию выражение exp/2 вычисленное заранее извне — хороший вариант. Во-первых, внутри функции не будет заниматься лишнее место на стеке (для проведения этой операции). Во-вторых, при множественном вызове такой функции, с одинаковым значением какого-то из параметров, сократится время на вычисления — он будет вычислен единожды в вызывающей функции и затем повторно использоваться. А если у нас exp/2 будет равно двойке, то и вовсе проще дважды умножить base на себя и выйти, вернув этот результат.

05/01/2022

System concepts (1): BSS — long forgotten but ever-present

In this post we'll take a look at one interesting concept of modern operating systems — BSS. Maybe some of you have not heard of it at all; some of you may think of it as of some sort of ancient thing and suppose it is not used these days. In first part of this post we will examine the purpose it was ever invented for. In second part we'll show how it is used these days ubiquitously even if you don't know about it.

Historically, BSS stands for "Block Started by Symbol" or "Block Starting Symbol". But we will not deepen in history because these days none of that acronym is meaningful.

Technically, BSS is a section of data in (object or executable) file. If it is a section, you may suppose we can declare it in the assembly language with .section directive. Well, that's right. Let's do it.

        .section .bss
	.lcomm var, 1 #1000

	.global main
	.text
    main:
	xorq %rax, %rax
        retq

By the way, modern assembly translators do not require keyword .section, .bss is enough. 

Let's explain what is done here. We declared .bss section with a variable named var in it with a parameter 1 (which may look like a value, but it is not). Compile and look at the resulting a.out, it's size in my case (Linux/GCC) is 16496 bytes. Now we change the parameter 1 to 1000, for example. Compile and look at size of a.out — it remains the same. "What kind of magic is that?!" It's "white" magic and now it's time to explain the whole thing. BSS was invented to save space on disk (or other storage or network bandwidth). And, yes, it's really that "ancient" invention — it originates to the 1950s. You can use it in cases when you don't need to specify values of storage area (variable), for example — to declare buffers which you will write to later, at runtime. You see that variables in .bss are declared in a different manner. It has no .globl directive — it uses .lcomm/.comm instead to specify it's visibility. .lcomm stands for local module visibility, while .comm — for global (some sort of .globl directive). The parameter (1, 1000 in our case or whatever you want) is the size of the buffer. How it works? Linker writes symbol with variable name and address and count of bytes in resulting module. At runtime .bss variables are expanded in memory to specified size and (while it's not standard, usually) initialized with zeroes.

The opposite way to declare zero-filled array is to use .fill directive on regular variable in .data section — in this case all zeroes will be written to resulting module increasing it's size. I'll omit this example here, but you may check it by yourself:

        .data
    big_var:
	.fill 1000000

At this moment you may think — "This is a good idea. But could I use it to optimize my high level language programs with this knowledge?". I've had same thoughts and have checked it. The answer is "yes", moreover — you already often do this. Here starts the second part of our little research.

The short receipt is — just declare your variables as global arrays and initialize them as { 0 }. Let's prove it:

    char lBSSVar[1] = { 0 };
    int main ()
    {
        return 0;
    }

Compile and check the resulting file size. For example, on my machine a.out is 16496 bytes (same size as a.out I've got from assembly language code). Now change the size of lBSSVar to 1000, recompile and see the size of a.out is not changed.

Let's see if it is BSS or something else by examining assembly code we get of our C-code:

	.globl	lBSSVar
	.bss
	.align 32
	.type	lBSSVar, @object
	.size	lBSSVar, 1000
    lBSSVar:
	.zero	1000

We see in this list (I've posted here the part that we are interested in only) that compiler made .bss section from our C-code. Syntax differs from my raw assembly example but technical idea is the same.

P.S.
The idea to save space in object files I've demonstrated in this post is really old. But as we see in last list syntax may differ. For example, clang uses .zerofill directive to describe uninitialized buffers. Thus different compilers may use different syntax. So, in my my opinion, if you code your application in raw assembly language you can use syntax I've shown in first part of this post — it is a short way to use BSS idea. Second part of this post lets you know how to declare buffers you don't need to initialize in high-level languages saving some extra space on storage devices and a little time loading it.

16/09/2021

Маленький взлом системы (2): заставим процессор выполнить переменную

В прошлой статье (Маленький взлом системы (1): наконец-то вы сможете изменять строки типа char* String) мы рассмотрели как можно писать в область памяти, в которую запись запрещена. Сегодня мы разовьём эту тему и попробуем выполнить переменную. Начнём с подготовительных работ. Как и в прошлый раз, нам придётся изменить режим доступа к памяти, так чтобы её можно было исполнять, то есть, чтобы выполнить call или jmp на неё. Для этого объявим переменную следующим образом:

uint8_t lMagicCode [] = { };

Содержание её мы рассмотрим ниже. А сейчас выполним установку необходимых нам режимов доступа к этой переменной. Здесь повторим действия из примера в предыдущей статьи:

void* lPageBoundary = (void*) ((long) lMagicCode & ~(getpagesize () - 1));
mprotect (lPageBoundary, 1, PROT_EXEC | PROT_WRITE);

Мы добавили режим доступа PROT_EXEC, который и позволит нам выполнить переменную. Как вы заметили, я оставил режим PROT_WRITE. Это сделано потому что MMU работает со страницами. На наших машинах её размер, вероятнее всего будет 4kB, что довольно много и помеченная страница скорее всего заденет область, следующую за интересующей нас переменной. А нам нужен режим чтения и записи для обвязки нашего эксперимента. Поэтому чтобы не схватить SIGSEGV после вызова mprotect, мы выставляем совмещённый режим — запись (которая на x86 MMU не бывает без режима чтения) и выполнение.

Далее нам нужно как-то указать машине, как и когда выполнить эту переменную. На всякий случай помечу: эта переменная сама по себе никогда не выполнится, функция mprotect не запустит её, а лишь пометит, как доступную для выполнения. Перейдём к последнему этапу подготовительных работ на высоком уровне (на уровне C), который заключается в том, чтобы объявить функцию, ссылающуюся на адрес этой переменной, что в последствии, позволит нам выполнять эту переменную.

unsigned int (*NewExit)(unsigned long _ExitCode) = (void*)lMagicCode;

Здесь мы объявили указатель на функцию с названием NewExit, принимающую один параметр _ExitCode и возвращающую значение. Теперь мы можем выполнить нашу переменную и получить возвращаемое ею значение простой строчкой:

unsigned int lResult = NewExit (1);

Пока не надо этого делать — пытаясь выполнить пустую переменную, то есть, выполняя call на адрес пустой переменной, вы по сути дела, «провалитесь» дальше (как было в примере предыдущей статьи с выводом строки), а там может быть что угодно и, скорее всего, данные, даже не похожие на последовательность байт, представляющую собой корректную машинную команду с операндами. Что, вероятнее всего, приведёт к ошибке Illegal instruction.

Далее нам нужно заполнить нашу переменную корректным кодом. Начнём, допустим с возврата, то есть исключим «проваливание» потока выполнения при вызове этой функции. По справочникам найдём код операции ret, который очень простой и равен C3h.

uint8_t lMagicCode [] = { 0xC3 };

На самом деле это retn — return near, то есть ближний возврат или внутрисегментный возврат — возврат в пределах одного сегмента кода.

Теперь можно выполнить нашу функцию, не боясь неопределённых последствий, так как теперь произойдёт следующее: команда call сохранит в стэке адрес возврата равный адресу операнда следующему сразу за ним (адрес call + длинна инструкции call и его операнда), затем перейдёт по указанному адресу (адресу нашей переменной), в которой лежит код команды ret, которая, в свою очередь извлечёт адрес возврата из стэка и выполнит переход по нему.

printf ("The result of NewExit is: %d.\n", lResult);

Сейчас мы будем видеть «мусор» на выходе из функции. Для получения какого-то осмысленного результата, вернём значение. Значение из функции по правилам x86_64 ABI возвращается через регистр «семейства» AX. Мы объявили нашу функцию как возвращающую значение unsigned int, то есть 32 бита. Значит нам нужно записать результат в EAX — 32-битный регистр. Найдём код загрузки константы в EAX — это B8h. При заполнении команды нужно учитывать размерность операнда. Здесь мы не можем написать 0xB8, 0x04. Нам нужно указать всё значение полностью.

В языках высокого уровня это за нас делает компилятор. GAS так же подставляет дополненные значения в зависимости от суффикса мнемонического представления команды, например movl $0x04, %eax запишет в реальный код B804000000h, дополнив нашу четвёрку нулями до длинны long. Здесь это не тот long, о котором мы говорили в моделях памяти (LP64), это long с точки зрения x86-ой машины длинной 32 бита. Некоторые трансляторы даже подбирают код конкретной операции из обобщённого мнемонического представления и дополняют размерность операнда в зависимости от указанных размерностей приёмника и источника.

В противном случае, если мы не распишем все байты составляющие 32-битное значение, последующий код, который мы запишем как следующую инструкцию или операнд, в процессе выборки команды процессором будет воспринят как данные для загрузки в EAX. Поэтому соберём код 0xB8, 0x07, 0x00, 0x00, 0x00. И наша переменная приобретёт следующий вид:

uint8_t lMagicCode [] =
{
  0xB8, 0x07, 0x00, 0x00, 0x00, // movl $0x7, %eax
  0xC3,                         // retn
};

Теперь, вызвав эту функцию как

printf ("The result of NewExit is: %d.\n", NewExit (0));

Мы получим:

The result of NewExit is: 7

Функция названа NewExit. Давайте придадим ей этот смысл. Для этого мы воспользуемся системной функцией под номером 60/3Ch и вызовем её при помощи syscall, которая имеет код операции 0F05h. Операционная система, при вызове системных функций от пользовательских процессов, принимает номер функции в регистре EAX. Писать туда мы уже умеем. Наша переменная с машинным кодом приобретёт такой вид (возврат из этой функции нам уже не нужен):

uint8_t lMagicCode [] =
{
   0xB8, 0x3C, 0, 0, 0, // movl $0x3C, %eax, 3Ch/60 - system call exit()
   0x0F, 0x05,          // syscall
};

У нашей функции есть один параметр. Этот параметр попадёт в возвращаемое значение функции main, то есть код нашей переменной аналогичен функции exit. Вы можете это проверить запустив программу на исполнение и проверив код выхода при помощи echo $?. Вы увидите число переданное вами в функцию NewExit.

Хоть мы и передаём int в качестве параметра, возвращаемое из функции main значение всегда снижается системой до одного байта, поэтому не имеет смысла задавать значения более 255.

Здесь может возникнуть вопрос — «А почему так? Мы так много кода писали для простейших действий, а возвращаемое значение попадает в систему без каких либо операций вообще». Дело в том, что по правилам x86_64 ABI первый параметр, указанный в функции, записывается в регистр RDI. А системная функция exit, 60/3Ch возвращает в систему в качестве кода выхода значение RDI. Так совпало — наше значение «провалилось» насквозь и попало в оболочку, в переменную $? и писать нам для этого действительно ничего не пришлось.

P.S.
Интересно отметить тот факт, что отлаживать участки программы, представленные в виде переменных и сформированные в качестве их содержания, будет проблематично. Это связано с тем, что этот код попадает в секцию .data, которая отладчиком не рассматривается как код. Даже если посмотреть ассемблерный вывод после компилятора, то мы увидим просто переменную, в которой будут числовые значения — перечень наших байт в объявленном массиве.

        .size   lMagicCode, 7
lMagicCode:
        .byte   -72
        .byte   60
        .byte   0
        .byte   0
        .byte   0
        .byte   15
        .byte   5

Ниже приведу всю программу:

#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/mman.h>

int main (void)
{
  uint8_t lMagicCode [] =
  {
//     0xB8, 0x3C, 0, 0, 0,          // movl $0x3C, %eax # 3Ch/60 - system call exit()
//     0x0F, 0x05,                   // syscall
    0xB8, 0x07, 0x00, 0x00, 0x00, // movl $0x7, %eax
    0xC3,                         // retn
  };
  
  void* lPageBoundary = (void*) ((long) lMagicCode & ~(getpagesize () - 1));
  
  mprotect (lPageBoundary, 1, PROT_EXEC | PROT_WRITE);
  
  unsigned int (*NewExit)(unsigned long _ExitCode) = (void*)lMagicCode;
  
  unsigned long lResult = NewExit (2);
  
  printf ("The result of NewExit is: %d. Or will never be printed...\n", NewExit (0));
  
  return 0;
}