DaftSoft: ARMv8

Showing posts with label ARMv8. Show all posts

16/12/2025

Barium No-Boot (6): exceptions (II)

Generate Data Abort exception

It's time to play. Let's generate Data Abort exception. Aborts of both read and write types are done in init_board() function (barium.c file) in exactly the same manner as mentioned above:


	/* Raise Data Abort Exceptions (Write and Read) */
	raw_writel(0, INVALID_ADDRESS);
	raw_readl(INVALID_ADDRESS);

And check the output:


	Exception: 04, ALU Core №: 0
	Class: DAF: FFFFFFFFFFFFFFFF
	Memory operation attempt: WR
	Adjusted ELR_EL3 forward (4)

	Exception: 04, ALU Core №: 0
	Class: DAF: FFFFFFFFFFFFFFFF
	Memory operation attempt: RD
	Adjusted ELR_EL3 forward (4)

We see exception group, ALU Core number, exception class (DAF — Data Abort Fault), address the attempt to access was made to (FFFFFFFFFFFFFFFF) and operation type — WR or RD.

Generate Data Alignment exception

To achieve this goal we'll write a small routine in assembly language — _fetch_mem (see it in a newly added file low_level.s) which looks like:


.globl _fetch_mem
_fetch_mem:
	# According to ABI x0 is the first parameter and 
	# the return value, thus all we have to do is just:
	ldr x0, [x0]
	ret

and is declared in barium.c file as:


	extern uint64_t _fetch_mem(uint64_t _Addr);

Let's test this function with some correct input value — our IMAGE_LOAD_ADDR is the good one:


	lVal = _fetch_mem(IMAGE_LOAD_ADDR);

And see what we get:


	Read Value: 24F239D584007AB2

We know that there should be ARMv8 opcodes. Let's check that by decoding the values we've just got:


	mrs x4, s3_1_c15_c2_1
	orr x4, x4, #0x40

This looks very familiar. Let's recall where we've seen this before — our very first instructions from start_64.s file:


.globl _start_64
_start_64:
	# Load CPU Extended Control Register:
	mrs x4, S3_1_C15_C2_1
	# Enable SMP:
	orr x4, x4, #(1 << 6)

This means that our _fetch_mem routine works properly and we can proceed to raising Alignment Fault exception. We'll do this with:


	lVal = _fetch_mem(0xDEAD);

And see what we get:


	Exception: 04, ALU Core №: 0
	Class: DAL: 000000000000DEAD
	Memory operation attempt: RD
	Adjusted ELR_EL3 forward (4)

	Read Value: ADDE000000000000

We see exception group, ALU Core number, exception class (DAL — Data Alignment Fault), address the attempt to access was made to (000000000000DEAD) and operation type — WR or RD.

Generate Invalid Instruction exception

For this purpose we have a small routine _udf (low_level.s):


.globl _udf
_udf:
	udf #0
	ret

The output we see is:


	Exception: 04, ALU Core №: 0
	Class: UNHANDLED: EC: 000000
	ELR Values: 0000000000921820
	ESR Values: 0000000002000000
	Adjusted ELR_EL3 forward (4)

Here we output some information but it's almost useless.

Generate exception by executing SMC instruction

After setting VBAR and implementing handler for SMC exception we can place SMC instruction in our code and it will generate exception and ALU will select corresponding handler and run it by just branching to its address. Let's review the SMC instruction itself. Its format is:


	smc #imm16

#imm16 means that instruction takes so called 16-bit «immediate value» — a value that can be obtained during compilation only — a number itself or a #define. In such case we cannot use register as a parameter to such instruction. ARMv8 instructions are 32-bit long and consist of opcode and its parameters. During translation of assembly code into machine code, assembler forms code of this instruction using specified immediate value. That's what we have about SMC instruction and its nature.

We can obtain all 16 bits of this value later in handler from exception syndrome — it was shown in our exception_handler() in previous part of this post. But how can we specify this parameter when we need to pass different values? For example, we want to pass some information via this parameter to our handler. It looks like this parameter is designed exactly for that purposes but it is not usable because it is «immediate value». Well, yes, it is probably not usable for that purposes. Of course, we can implement a big block switch/case of if/else which would look like:


	if (a)
		smc #1
	else
	if (b)
		smc #2

	...

	else
		smc #0xFFFF

But this would be enormous block of boring code. But we don't like big blocks of boring code and we used to do something exceptional in our posts. Okay, let's not make exceptions in this today. We will present a method to pass variable argument to SMC instruction at run-time and it will be a small piece of code.

How can we do that? We'll do this by forming out the SMC instruction itself with specified immediate value, write it to some memory address and execute it — we will implement a function that will do a part of assembler's work but in run-time. In case we need to perform SMC instruction we will branch to some function that forms SMC instruction and after that executes it. This is done in smc.s in _smc function. What is actually done here? First we form SMC 0 instruction — it will be the base for the one we need. Its opcode is D4000003h. Then we cut 16 bits off of parameter of _smc function (x0), shift it by 5 — this is the exact offset of #imm16 in SMC instruction, and add this value to base opcode we've formed above. That's all about forming opcode of SMC instruction with given #imm16 as a parameter. After that we store this opcode in address of _smc_instruction label. That could look like complete solution — just branch to (or fall to) _smc_instruction, but it would not work. And that is because of caches. We remember that we have turned them on already and that our application is so tiny that it fits in caches entirely. So, the code runs completely inside cache. Thus we need to force ALU to refetch our newly generated instruction. This is done by flushing caches — marking some memory address as outdated in cache. And now this is the last part. We flush caches and fall to our newly formed instruction (_smc_instruction) without branches because it is located right after _smc function. Below you can see the code:


.globl _smc
_smc:
	# Form instruction - smc with given immediate value.
	# Form instruction SMC 0 - the base for desired one,
	# its opcode is D4000003h:
	mov x1, 0x0003
	movk x1, 0xD400, lsl 16

	# Ensure we have exact amount of bits we need - immediate is 16 bit long:
	and x0, x0, 0xFFFF

	# Shift the immediate value to position it takes in instruction:
	lsl x0, x0, #5

	# Put the immediate value into instruction code by orring:
	orr x0, x0, x1

	# Obtain the address we want to modify:
	ldr x1, =_smc_instruction

	# Write new instruction to destination address:
	str w0, [x1]

	# As we have caches enabled we have to mark memory region that 
	# contains our newly generated instruction as outdated to 
	# force ALU to refetch the instruction:
	adr x1, _smc_instruction
	# Flush (invalidate) D-caches:
	dc cvau, x1
	dsb ish
	# Flush (invalidate) I-caches:
	ic ivau, x1
	dsb ish
	isb

Now let's review the template of _smc_instruction function. Here we have a SMC instruction with base immediate value, which we took as 0. What we have to keep in mind here is that we still are in function at the moment — we didn't branch to it, but we've fallen to section _smc_instruction which was generated by _smc function from the last one. _smc_instruction contains our SMC instruction. After executing it, the ALU will run exception handler and, after returning from it, will run the instruction immediately following SMC. Thus we have to add ret instruction as part of _smc function.

Any function has some return value. What could we return from _smc? It could be interesting to return the opcode of generated instruction. That's exactly what we'll do. But we will not do any additional moving of data here because we already have our SMC opcode in x0 — which is the register that contains return value of a function (according to ABI). You can see this section below:


	# Fall to newly formed instruction:
_smc_instruction:
	# smc with default immediate value:
	smc 0x0000
	# We keep in mind that we are still in function (_form_smc), which 
	# is called from C-code. Thus, have to put ret here. Also we use the 
	# return value for reviewing of instruction code we've generated. 
	# At this moment it is stored in w0, thus we do not move any values.
	ret

Now we have code that causes exception — SMC with immediate value as given parameter. So it's time to put call to this routine somewhere with some parameter (immediate value). What could be interesting? How can we put it all together to get a nice result? We'll use symbols we get from UART as an immediate values for SMC and output its opcode:


loop_uart:
	/* Read from UART and output received bytes in a loop */
	lVal = uart_get_char(uart_base);
	/* Form exception and raise it */
	lVal = _smc(lVal);

	uart_output_string(uart_base, "Instruction opcode: ");
	uart_output_hex(uart_base, (uint8_t*)&lVal, 0, 4);
	uart_output_string(uart_base, "\r\n\r\n");
	goto loop_uart;

Here's what we have as a result:


	Exception: 04, ALU Core №: 3
	Class: SMC, #imm Value: 0020
	Instruction opcode: 030400D4

First two lines are from exception handler. The first of it outputs exception group, and core number on which exception occurred. The second outputs exception type (class) and its immediate value. Third line if from barium_main() function — it outputs the return value of _smc function which is the opcode of instruction we've generated.

We have collected initial VBARs from all ALU Cores. Let's see what we have:


	ALU Core №: 0
	Vector BAR: 0000000000000000

	ALU Core №: 1
	Vector BAR: 0000000000000000

	ALU Core №: 2
	Vector BAR: 0000000000000000

	ALU Core №: 3
	Vector BAR: 0000000000000000

What we see here? VBAR initially is set to zeros for all cores, as stated by datasheet (p.706 — Figure 6-2. Internal ROM and RAM memory map) — vectors are located at zero.

That's all for today. We've reviewed the basics of exception model of ARMv8 machine with templates and some practice. As we've mentioned in first part of this post — there is much more about exceptions and it's up to you to learn further.

You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage IV» directory).

26/09/2025

Barium No-Boot (6): exceptions (I)

Preface

Well, today we are going to continue to learn the ARMv8 machine.

One day we've started with «one-dimensional» product — one core executing straight, linear code. In previous post, we've added new, second «dimension» — we've made the product multi-core. Now it has four ALU cores running independent code simultaneously, in parallel. This time, we will add another one dimension (or at least a half of a dimension) to our product. We will review one important concept of all computing machines, and ARMv8 is not an exception — exceptions.

Theory

Let's start with the theory. The conception of exceptions consists of three parts: cause (in terms of ARMv8 — «syndrome»), handler (a code that processes arised exception) and a entity that binds cause and handler (in terms of ARMv8 — «Vector Table»).

The handlers, despite that they are divided in groups, from the technical point of view are absolutely identical for all exceptions. Vector table, the entity that associates exceptions with their corresponding handlers, is just a 2kB of aligned in a certain way code, divided in 16 equal-sized blocks. Each of that blocks consists of 32 ARMv8 instructions. As we can see, Vector Table is binding entity by its structure (strictly regulated by ARMv8 standard) and a set of handlers by its content.

The causes or, generally speaking — exceptions themselves, are of two types — asynchronous and synchronous. Asynchronous are those what take place when, let's say — some «external» event occurs. Interrupt is a good example of asynchronous exception — you never know when an interrupt will occur while you are writing code. Synchronous exceptions are those what arise immediately after some instruction is executed or, in some cases, somewhere inside of execution process. In other words, synchronous exception is just a reaction to instruction. It can be arised by an instruction that caused some error — the situation when the machine can't continue to function normally without handling the error. An example of such an exception is a case of illegal instruction — situation where a fetched instruction could not be decoded (and executed). Attempt to access invalid address of memory is another example of synchronous exception. «Handling» such errors implies analysing condition of computing machine and an attempt to fix that state prior to allowing the code to continue to run, or, in worst cases — preventing the machine from running further code by (usually) leaving it in an infinite loop. And at last, there is a set of synchronous exceptions that are designed not to handle errors, but to serve regular, normal duties on a developers purpose. Synchronous exceptions will be the topic of our today's post — we'll review Data Abort and Data Alignment exceptions as examples of error/fault and so called System Monitor Call (SMC instruction) as example of exception for developers purpose. We chose these as subject of this post.

Let's review theory of the exact exceptions we are about to practice on.

The first — Data Abort. This exception arises on an attempt to access to a non-existent memory address. Data Abort exception can be generated easily with following:


	raw_writel(0, INVALID_ADDRESS);


	raw_readl(INVALID_ADDRESS);

where INVALID_ADDRESS can be defined, for example, as:


	#define INVALID_ADDRESS	(UINT64_MAX)

The ARMv8 architecture allows to determine if it was an attempt to read or to write, as well as address an attempt to access was made to. We'll review this functionality in code later.

Data Alignment exception can be caused by an attempt to load or store from or to an unaligned memory address by executing, for example:


	ldr x0, 0xDEAD

The simplest way to generate Illegal Instruction exception is:


	udf 0

Or assembly illegal instruction and try to execute it. In our practice part of this post we will assembly DEADBEEFh opcode, which does not represent any valid ARMv8-A instruction.

SMC is another one synchronous exception and from the developer's point of view looks like a regular call (in terms of ARMv8 — branch) to a normal function, because its handler will be executed immediately after this instruction and before executing the next one and doesn't require to fix any things (because it is not a reaction to any kind of error or fault). It is good as a case of exception to learn on ARMv8 and to play with.

Thus, here is our plan for this post:

Construct Vector Table
Configure ALU to use our Vector Table
Design handler for Data Abort exception
Design handler for Data Alignment exception
Design handler for Invalid Instruction exception
Design handler for SMC exception
Generate Data Abort exception
Generate Data Alignment exception
Generate Invalid Instruction exception
Generate exception with SMC instruction

Let's get it started and run through the plan (in two parts).

Construct Vector Table

The Vector Table is a block of regular ARMv8 code 2kB in size, split in 16 sections 128 bytes in size each. The placement of Vector Table also must be aligned to 2kB boundary. We will place our Vector Table in a separate file — vectors_64.s. It is aligned to 2kB and consists of 16 section aligned to 128 bytes. The minimal acceptable template of Vector Table of ARMv8 could consist of header like:


.arch armv8-a

.balign 0x800
.globl _vectors_el3
_vectors_el3:

and 16 equal blocks of handlers that look like:



.balign 0x80
	bl some_exception_handler
	eret

In our case, we have default handler as shown below:


	stp x0, x1,   [sp, #-16]!
	stp x2, x3,   [sp, #-16]!
	stp x29, x30, [sp, #-16]!
	bl _exception_entry
	bl exception_handler
	ldp x29, x30, [sp], #16
	ldp x2, x3,   [sp], #16
	ldp x0, x1,   [sp], #16
	eret

Here we preserve registers that we'll use, call local function _exception_entry and global routine exception_handler() (located in exceptions_64.c file). Where _exception_entry is declared as:


_exception_entry:

	# Calculate Exception group No. as first parameter of default handler:
	# Exception Group No. = (instruction address - _vectors_el3) / 80h
	# We get instruction address from the Link Register as it points to 
	# address of instruction next to bl, which lead to this function, thus 
	# it represents address inside of an exception group in Vector Table:
	mov x0, lr
	ldr x1, =_vectors_el3
	sub x0, x0, x1
	mov x1, 0x80
	udiv x0, x0, x1

	# Get core number from MPIDR and store it as second parameter:
	mrs x1, MPIDR_EL1
	and x1, x1, 0xFF

	# Pass Exception Link Register as third parameter:
	mrs x2, ELR_EL3

	# Pass Exception Syndrome Register as fourth parameter:
	mrs x3, ESR_EL3

	ret

In this function we prepare some data for real exception handler: gather information for later review — calculate the number of exception group, get number of ALU core on which exception arised, exception link register and exception syndrome register. exception_handler() will be reviewed later.

Configure ALU to use our Vector Table

Vector Bar on ARMv8 is set up by writing its address to VBAR_EL3 register. VBAR stands for Vector Base Address Register. This is done in crt0_64.S in a newly appended function _set_vbar_el3 which is called from _crt0_main_64 function. Here we store the initial value of VBAR_EL3 register as fourth parameter of barium_main() for later review. The rest is simply and clear here:


_set_vbar_el3:
	# Store initial RVBAR as fourth parameter:
	mrs x3, VBAR_EL3

	ldr x7, =_vectors_el3
	msr VBAR_EL3, x7
	dsb sy
	isb
	ret

Design handlers for Data Abort, Data Alignment, Invalid Instruction and SMC exceptions

The whole exception_handler() routine is depicted in list below:



void exception_handler(uint64_t _EG, uint64_t _CoreNo, uint64_t _ELR, uint64_t _ESR)
{
	uint64_t lVal;
	uint16_t lEC;
	uint8_t lDFSC;

	uart_output_string(uart_base, "Exception: ");
	uart_output_hex(uart_base, (uint8_t*)&_EG, 0, 1);
	uart_output_string(uart_base, ", ALU Core №: ");
	uart_output_dec(uart_base, _CoreNo);	
	uart_output_string(uart_base, "\r\nClass: ");

	/*
	 * Get Exception Class (EC) field - it is [31:26] bits of ESR_EL3:
	 * EC, bits [31:26], ARM DDI0601 (ID092025), p.693
	 */
	lEC = (_ESR >> EC_SHIFT) & EC_MASK;

	if ((lEC & EC_SMC) == EC_SMC)
	{
		uart_output_string(uart_base, "SMC, #imm Value: ");

		/*
		 * Get immediate value - it is [15:0] bits of ESR_EL3:
		 * imm16, bits [15:0], ARM DDI0601 (ID092025), p.712
		 */
		lVal = bswap_64((_ESR & SMC_IMM_MASK));
		uart_output_hex(uart_base, (uint8_t*)&lVal, 6, 2);
		uart_output_string(uart_base, "\r\n");
	} else
	/* Data Abort & Data Alignment Faults */
	if ((lEC & EC_DATA_ABORT) == EC_DATA_ABORT)
	{
		lDFSC = _ESR & DFSC_MASK;
		if (lDFSC == ISS_DAB_FAULT)
			uart_output_string(uart_base, "DAF: ");
		else
		if (lDFSC == ISS_DAL_FAULT)
			uart_output_string(uart_base, "DAL: ");
		else
		{
			uart_output_hex(uart_base, (uint8_t*)&lDFSC, 0, 1);
			uart_output_string(uart_base, "h: ");
		}
		/*
		 * Data Abort Exception leaves Exception Link Register pointing to 
		 * instruction that caused Data Abort. For testing purposes we adjust
		 * ELR_EL3 to make it pointing to next instruction
		 */
		lVal = bswap_64(_get_far_el3());
		uart_output_hex(uart_base, (uint8_t*)&lVal, 0, 8);
		uart_output_string(uart_base, "\r\nMemory operation attempt: ");
		uart_output_string(uart_base, _ESR & (1 << ISS_DA_WNR_BIT) ? "WR" : "RD");		
		uart_output_string(uart_base, "\r\nAdjusted ELR_EL3 forward (4)\r\n");
		_adjust_elr_el3(4);
		uart_output_string(uart_base, "\r\n");
	} else
	/* Unhandled Exception */
	{
		uart_output_string(uart_base, "UNHANDLED: EC: ");
		uart_output_bin (uart_base, _ESR, EC_SHIFT, EC_LENGTH);
		uart_output_string(uart_base, "\r\nELR Values: ");
		lVal = bswap_64(_ELR);
		uart_output_hex(uart_base, (uint8_t*)&lVal, 0, 8);
		uart_output_string(uart_base, "\r\nESR Values: ");
		lVal = bswap_64(_ESR);
		uart_output_hex(uart_base, (uint8_t*)&lVal, 0, 8);
		uart_output_string(uart_base, "\r\nAdjusted ELR_EL3 forward (4)\r\n");
		_adjust_elr_el3(4);
		uart_output_string(uart_base, "\r\n");
	}

	return;
}

Let's take a closer look at functionality of this routine. First, it outputs Exception Group — the exact number of a section in Vector Table we've got our exception to (or from — depends on point of view). Second, it outputs number of ALU Core on which exception arised. After that goes parsing of the data we've gathered in our _exception_entry routine and output of information about exception being arised. The information used in this analysis is stored in ESR — Exception Syndrome Register. We get Exception Class field of ESR. This is the top-level starting point in handling of any exception — all exceptions are divided and distinguished from each other by EC. The first exception that is processed is SMC. We output its immediate value. We'll examine the nature of SMC instruction a little later. The second exception being parsed is Data Abort. If you've skimmed through the text of exception_handler() routine, you could notice that there is no Data Alignment section in EC. That's right. Data Abort and Data Alignment both belong to one EC. We decide if it Access or Alignment fault by another field of ESR — DFSC. DFSC stands for Data Fault Status Code. By examining this field we know if it was Access or Alignment fault. But what we want to see if we get Data Access or Alignment fault? We want to see two things — the address the attempt to access was made to and the type of access — was it read or write. Thus, here we get FAR — Fault Address Register, register that contains the exact memory address the software attempted to access. And via Write Not Read bit field we get type of operation that lead to exception. Usually, both Data Abort exceptions are used to correct things — to get page from swap, for example. Because of that, ARMv8 ALU leaves Exception Link Register (the register containing address where the program flow will continue after exception handler exit) containing address of the exact instruction that was trying to access memory instead of address of next instruction. We should fix problems with memory and return from exception and ALU should re-execute the same instruction. But in our case, for our small learning and researching purposes, we'll solve this problem in a different manner. As ARMv8 has fixed instruction length we just adjust the ELR by 4 bytes (_adjust_elr_el3 — simple routine in assembly language). And after that manipulations the flow will continue (with some values received as result of memory access operation), while without adjusting ELR we would get into infinite loop of instruction and this exception. But where is the Invalid Instruction? On ARMv8 it falls to so called unhandled exception and we have little to display and to fix in this case. Just to let our code make its way ahead we adjust ELR forward. That all about exception handler. There is a lot more to explore, you see titles of documents and pages in comments — it's up to you to learn and play onward.

That's all for today. In next post we'll continue to learn and practice exceptions on ARMv8 machine.

14/05/2025

Barium No-Boot (5): from UP to SMP: go multi-core!

Intro

Well, we have started our SoC, configured necessary PLLs and clocks, setup minimal amount of hardware blocks — GPIO, UART, made it output information — simple debug interface, enabled caches and learned how to set the ALU speed. In the first post of this series we've reviewed benefits of learning modern ARMv8-A machine. Among others, we've mentioned one significant, important thing to learn these days — multi-core ALUs. In this post we will start up all cores of our SoC (iMX8MP) and run independent code on each of them.

So what we have at this moment, at Stage II? An educational platform that has entry points for working in C and assembly language (to learn ARMv8 architecture) and works in single-core mode. Today we want it to be reworked into multi-core platform, which, in addition to meeting the same technical requirements — entry points for working in C and assembly language, will be able to execute separate code for each ALU core — we want to go SMP!

Redesign

We are talking about SMP feature today, but core start-up routine is not that big compared to the rework we have to do this time to go multi-core. Thus, we'll cover it later and start our lesson with (almost complete) project redesign to comply multi-core development, which is actually the biggest part of today's work.

We will go from top to bottom level because it is easier to explain and understand today's work. We'll redesign our main C-function (barium.c), C-runtime (crt0_64.s), modify startup code (start_64.s) and even linker script (barium.lds).

C-code

Let's start from the top level — payload C-code organization of our main C-file — barium.c. We want a multi-core platform that has entry points in C (as well as in assembly language, but we'll cover this later) for each ALU core. From the architecture point of view, we could implement a separate routine for each ALU — from barium_main_0() to barium_main_N() and it would look pretty clear code to a human. But this would look more like asymmetric multiprocessing programming (AMP). Thus, we'll keep single main function consisting of parts of common for all cores code and parts of separate code for each of the cores the function is executed by. This is to know how it feels — writing parallelly executed code for multi-core environment. Hence, barium_main() will stay in the project and get a new parameter — _CoreNo, to determine core number it is running on.

When we were working in the single-core mode, some ALU was running and initialised the entire system and executed our payload code. That was zeroth ALU. This time we'll leave this part of project intact — zeroth ALU will initialise the system, including new steps — starting other ALUs. After that, just like it was at previous stage, it will fall into measure loop. Third core will perform reading from and outputting to UART loop. The remaining two will fall in dead loop at this stage.

Let's say we want each ALU to output information about itself, like our zeroth did at previous stages — it will be interesting to review our initialisation routine and some aspects of CPU start-up process. But we can assume that we will get a mess in the UART if we start all cores in parallel. And, believe me — we will have a mess of symbols in UART. This is because we do not have any synchronization and/or locking mechanisms yet. Hence, how should we output information about (and from) each ALU core that runs this code almost simultaneously? We'll do some desynchronisation. For that reason we will have an array of close but different values for each core — uint32_t desync_loops []:


   uint32_t desync_loops [] = { 0, 250000, 350000, 450000 };

These values are empty loops which every core will perform before proceeding to its payload, specifically to outputting initial information to UART. There is no statistics or calculation behind that values — I've just choose them by experiment. But this idea gives us correct initial output from all cores: from zeroth to third. Of course output is done once by each core, except the third, which continues to use UART in both directions in an infinite loop after outputting initial information.

So, our barium_main() function performs desynchronization loop and after that, in case of zeroth ALU core, executes newly added init_board() routine which initialises the hardware of our SoC (PLL, clocks, GPIO, UART), sets ALU speed, outputs general information about Barium and starts remaining three cores. We have a new routine for the last step:


    void start_cpu(uint32_t _Core, uintptr_t _EntryPoint);

And use it as shown below:


    start_cpu(1, (uintptr_t)_crt0_main_64);
    start_cpu(2, (uintptr_t)_crt0_main_64);
    start_cpu(3, (uintptr_t)_crt0_main_64);

Where entry_point is declared as:


    extern void _crt0_main_64(void);

After that goes some part of code which is executed by all ALU cores (including zeroth) — output information about itself. And, finally, there are two parts of code. One for zeroth — generating of measure impulses, and one for third — cycle that outputs symbols read from UART.

That's all here. You can see the rest in the code. start_cpu() routine itself will be reviewed later.

C-runtime

Now let's dive a little deeper — next level is C-runtime, our crt0_64.S file. Regarding machine code — program itself and program writing process, all ALU cores of homogeneous multi-core system (or all homogeneous ALU cores of heterogeneous system) work in the same and absolutely identical way. Do you remember what is C-runtime for? C-code uses stack and there is C-runtime for setting Stack Pointer (SP). Stack is used by any separate execution flow — be it a code running on a whole ALU core, or a separate process — so-called «task», or «thread» which runs in a time slot given to it. And any execution flow needs a separate stack because it is used independently. SP is a per-core, so-called special purpose register. That means that today we have to do the same thing in C-runtime as we did at previous stages — prepare our system, but in this case — each of its cores, for C-code. Thus, we have to set stack pointer for each ALU core. And, of course, this must be done in assembly language.

But how can we organise as many as four stacks? Let's just put our remaining three stacks below zeroth. Let's say each ALU core will have stack of equal size. Hence, the formula for each SP will be: Stack Base Top - (Stack Size * Core No.).

Maybe, you've noticed that extension of crt0 assembly file is capitalised now — .S. Capitalised extension of source file means that this file must be passed through preprocessor before compilation. Yes, we want our code to be a little configurable via environment variables at this stage, in particular, we want stack top and stack size to be configurable.

As shown above, we use _crt0_main_64 as entry point for all our newly powered up cores. And here is our SP calculation routine:


    # Get core number from MPIDR and store it as third parameter:
    mrs x2, MPIDR_EL1
    and x2, x2, 0xFF
    
    # Set stack pointer (per core) saving it as first parameter to
    # barium_main function. Calculate stack pointer for current core.
    # SP = SP base top - (core stack size * core No.):
    mov x5, STACK_SIZE
    mov x6, SP_BASE_TOP
    mul x1, x2, x5
    sub x1, x6, x1

ARMv8 has a per-core system register called Reset Vector Base Address Register (RVBAR). Let's get it to fourth parameter of barium_main() to examine later:


    # Store the RVBAR as fourth parameter:
    mrs x3, RVBAR_EL3

After setting stack pointer we are ready for C-code and branch to our redesigned barium_main() routine. This is how our assembly language multi-core entry point routine looks like — we can add assembly code depending on core number (x2 register in our case). Now we can perform different routines depending on core number in assembly language. That's all we have to do at this level.

Startup

Let's proceed to the most low-level startup routine — _start_64 function. There is not too much to rework here — the smallest part of work, just enable SMP feature of ARMv8:


    # Load CPU Extended Control Register:
    mrs x4, S3_1_C15_C2_1
    # Enable SMP:
    orr x4, x4, #(1 << 6)
    # Set CPU Extended Control Register:
    msr S3_1_C15_C2_1, x4

That's all we have to do at this level today. As for the code the project redesign is completed. But for human it's hard to perceive architecture of multi-core application which is described in text and written in code in a direct, single-flow manner. To get better overview for human, we'll provide the architecture scheme which represents routines executed by all cores and order of that routines. See below (click on the image to enlarge it):

Barium multi-core architecture

Linker script

As we are passing entry point address to our start_cpu() routine, we have to be able to get this address at runtime. Who (or what) is responsible for address resolving? It's linker — it collects all symbols from all object files being linked and resolves all addresses in resulting executable file. We are working on bare-metal, and each machine (SoC and it's BROM) have different addresses it loads executable to. Hence, we have to inform the linker about memory layout of our application and, specifically, where the memory starts — this will be the base address of our .text section. Also, we have to describe this section. If we omit memory layout description, linker will assume that memory (and our code) starts from zero (which would be wrong on, maybe all modern machines) and the address will be resolved wrongly. As we want our project to be a little configurable, we will pass our linker script through preprocessor (as we did with C-runtime assembly language file). This is to be able to specify memory start address via environment. We should always specify memory start equal to the address BROM loads our image to. And we already have that variable in our environment — IMAGE_LOAD_ADDR. We've used it to build boot image header at previous stages. This time we will add description of our memory (OnChip RAM) to linker script:


    MEMORY
    {
        .OCRAM : ORIGIN = IMAGE_LOAD_ADDR,
        LENGTH = (576 * 1024)
    }

And describe the location of .text section:


    .text (READONLY) :
    {
        . = ALIGN(8);
        *(.text*)
    } >.OCRAM

Linker does not pass its script to preprocessor regarding of the file name. We have to do it ourselves. Thus, we rename our linker script to .LDS and pass this file through preprocessor manually. Below is the rule to make .lds from .LDS (Makefile).


    $(PRODUCT).lds: $(PRODUCT).LDS
        @$(CPP) -P -DIMAGE_LOAD_ADDR=$(IMAGE_LOAD_ADDR) $(PRODUCT).LDS -o $(PRODUCT).lds

So now our project redesign is complete. Our build system compiles files with compiler, linker resolves addresses into correct values, _start_64 enables SMP feature on ARMv8, _crt0_main_64 sets four stack pointers for each ALU, barium_main() initialises the SoC, starts all ALUs with entry points. The new stage is set up and ready to be run.

ALU start-up

The only thing left to overview today is the most platform/hardware-specific routine — starting ALU cores. This is done in gpc.c/gpc.h files (General Power Controller). Here we perform ALU start-up sequence. This routine is not architectural and is pretty straight-forward, thus, we'll not bother too much with it. You'll see that the first step is to set entry point address for core. After that goes the reset and power-up routine. Lines of code have comments referencing page numbers of datasheet.

This is how we go from UP (uniprocessor) to SMP (symmetric multi processor) on iMX8MP SoC.

The results

It's time to play, get results, review them and make conclusions. Let's see what we get:


    Barium No-Boot V0.3 (iMX8MP)
    Build: 03:33:33, May 13 2025
    Running at: 2200MHz

    ALU Core №: 0
    Reset VBAR: 0000000000000000
    Initial PC: 0000000000920020
    Initial SP: 0000000000920000
    Current SP: 000000000091FFC0

    ALU Core №: 1
    Reset VBAR: 0000000000920020
    Initial PC: 0000000000920020
    Initial SP: 000000000091F000
    Current SP: 000000000091EFC0

    ALU Core №: 2
    Reset VBAR: 0000000000920020
    Initial PC: 0000000000920020
    Initial SP: 000000000091E000
    Current SP: 000000000091DFC0

    ALU Core №: 3
    Reset VBAR: 0000000000920020
    Initial PC: 0000000000920020
    Initial SP: 000000000091D000
    Current SP: 000000000091CFC0

    Awaiting symbols from UART:

We see four similar outputs from each core. First, let's check how our stack pointers are calculated and set: 920000h for zeroth core, 91F000h for first, 91E000h for second and 91D000h for third. So, everything is fine here, our stack pointers are calculated correctly down from 920000h with the step of 1000h for each core. Also notice that last 12 bits of current SP of each ALU are the same — FC0h. That's because we read current SP in the same code for each ALU and the «consumption», or use, of stack (up to the line where we get it) is identical for each core in our project. Thus, the memory map of our application looks like (build with default values of IMAGE_LOAD_ADDR and STACK_SIZE):

Barium memory map

We've also moved reading of initial PC from _start_64 to _crt0_main_64. That was done to be able to get it on all cores. Because _start_64 is executed by zeroth core only and remaining cores would not run that code. So, now initial PC is not address off the first instruction of our application. And that's why it is not equal to IMAGE_LOAD_ADDR anymore, but address of ALUs entry point instead.

What else is remarkable here? We've added _RVBAR parameter to our barium_main() function. You may have already noticed, that cores from the first to the third have the same Reset VBAR — 920020h. That is actual address of out entry point (_crt0_main_64) for that cores. But zeroth core has RVBAR of all zeros. That's how it is set for zeroth core by hardware and the address where BROM is located.

Another interesting point. You could notice, that in previous posts, when we've started zeroth ALU and configured it, the «Initial SP», we've output, was the stack pointer value before we've set it — stack pointer value that was set by BROM. But this time we output the value we've calculated. You can be curious what are the values of stack pointer of other ALUs. I wondered about that too, and checked it. The reason why this functionality is omitted in this post is that initial stack pointer values for remaining ALUs are actually garbage.

That's all for today. You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage III» directory).

12/03/2025

Barium No-Boot (4): measurements, rev up iMX8MP

Introduction

In previous posts we've made minimal but enough for studying and playing basic application which is able to output debug information. It sets some PLLs and clocks except, maybe, for most of us, the most interesting one — ARM PLL, or PLL which controls the speed of ALU. And we will do such research in this part. The announce of this post is partial, this is on a purpose — it will be more interesting for you to read without knowing the complete plan of this post.

The research plan

We can't measure ALU clock directly because it has no outputs outside of SoC. So, how should we act if we want to know if it is at least changed? We need to do something that depends on ALU speed, and what we can measure. What could it be? UART? No. We configure UART to output data on a particular baud rate, but that has nothing in common with ALU speed. UART configuration affects its TX and RX pins but impulses UART makes while transceiving data are irrelevant with ALU clock. Thus, we need different «interface» with SoC which can give us possibility to measure its clock. Let's turn to theory for a little. So-called ALU «cycle» is one tick of ALU clock, but single clock tick is not equal to an instruction, instruction may take from one to thousands of cycles. (Instruction that take zero cycles are out of scope of this post). But we know that any particular instruction takes the same amount of cycles at any ALU speed. Thus, real execution time of any block of instructions will directly vary according to different ALU speeds. We just need to have some measurable points of some «reference» block of instructions. The most obvious block of instructions is empty loop which translates into branch back to itself (even leaving link register intact) until the countdown variable reaches zero. What could be the rest of our setup? The most clear way to measure something on a SoC is GPIO. Hence, in this post we will initialise IOMUX Controller, one of pads of particular GPIO bank and put GPIO toggling function among with some empty loop (generating measured impulse length) into an infinite loop. After that we'll be able to measure impulses with a scope, change ALU PLL speed, leaving reference code intact and measure impulses again.

As we've reviewed the workflow on embedded systems without any BSP, bootloader, kernel — bare-metal, in previous posts, I'll not provide such details in this and further parts. You'll see what and how is done in the code. Some code lines have comments referencing datasheet page numbers. If you'll ever need to change, for example, GPIO number you can look datasheet around page numbers that you'll see in code and make necessary changes according to appropriate sections of datasheet.

The practice

This time we need a GPIO. As we've noticed in previous posts, GPIO is one of AFs (alternative functions) of a pad or ball. AFs are configured by IOMUX Controller. Therefore, we start with IOMUXC. First, as we did in previous posts, we check the section 6.1.4.2 — «Boot block activation» (p.706) and check if IOMUXC is configured by BROM or not. And we see that we are lucky today because it is initialised by BROM. Thus, we omit initialisation and configuration of this hardware block and proceed straight to GPIO.

Now, we have to see what GPIOs are available and choose one we will use. I'm working on DEBIX Model A SBC. In (most probably fantastic) case you have the same SBC, you can use the picture below and/or the same GPIO as we will use in this post. If you have another SBC, you have to find similar picture for your SBC. What you need is called «pinout» or «header pinout» or «gpio header». This information usually is included in your SBC datasheet or user manual. But sometimes it is easier and faster to search the Internet for something like «_your_sbc_name_ gpio pinout». Let's see what pinout we have:

DEBIX Model-A GPIO header pinout

There is a handy GPIO on this board — GPIO number 11 on bank 1 (pin 29). Today we will add two new files to our project — gpio.c and gpio.h. So, we add GPIO initialisation and set its direction to barium_main() function:


	/* Init GPIO and set 11th of first bank as output */
	gpio_init();
	gpio_set_dir(GPIO_BANK1, 11, GPIO_OUT);

After that we add our measurement code — the reference code we have mentioned earlier:


	/* Generate impulses for measurements */
loop_meas:
	gpio_toggle_val(GPIO_BANK1, 11);
	for (lVal = 0; lVal < IMPULSE_LEN; lVal++);
	goto loop_meas;

IMPULSE_LEN can have any value, but I've chosen 1 000 000 to fit the loop to comfortable scope view settings.

Build, write to SD-card, connect scope, start the board and see:

Measure at default ARM PLL settings

Here we see that period is approximately 570ms. The period width is uninformative itself. This is just how many time our SoC spends executing our reference cycle on the default ARM PLL settings. From «PLL setting by ROM» we remember that ARM PLL is set at 1GHz. Let's reconfigure it and see what measurements we'll have. ARM cores of iMX8MP are driven by so-called PLL1416x. I could not find any documentation describing it, so we will cope with SoC's datasheet (ARM PLL), see the registers we need to write values to and generate the exact values using the sheet I've managed to create from information I've gathered from around and got from my experiments. You'll see the sheet file in the repository and can play with it. The methodology is simple — you change the Main divider (the biggest number in the most left column) with step of 25 and see column «Clock (MHz)». The rest will be done by the sheet formulas. The table will display register value in hexadecimal and decimal, and two values — Pre div and Scaler. We can configure iMX8MP ARM PLL by using three values — Main div, Pre div and Scaler in formula, or just write the exact value from the most right columns.

iMX8MP ARM PLL Settings

The whole code is added to files clocks.c and clocks.h. And I utilise three values formula instead of a exact single value. That's because I used to do it this way while I was on my researches. Let's rev our SoC up to 1.8GHz — its stated maximum (p. 95). According to sheet we will use 225, 3, 0 coefficients:


        raw_writel((225 << 12) | (3 << 4) | (0), CCM_ANALOG_ARM_PLL_FDIV_CTL0);

Now build, write to SD-card, start the board and see:

Measure at 1.8GHz

And what we see here? 507ms. The period has narrowed, but not as we expected — 570ms@1GHz expected to be 570/1.8 what is 316ms@1.8GHz! What is going on here? To make long story short (it still is not the most interesting part of this post) I'll tell that I've measured different speeds and all periods appeared to be non-proportional to ARM clock. The short summary is shown in following table. I've made a little calculations also:

Non-linearity of ARM ALU clock speed

We see clocks we set, periods we have measured and two rows as values should be if reference (real true value) was at the slowest or at the fastest clock. Being calculated in any direction the calculated values have linear relation with clock speed. But measured values are non-linear and look like barely related to clock speed at all. The table displays that something is really unclear or even wrong here.

What could go wrong? NXP PLL1416x is undisclosed, thus, we don't know how it exactly works — we can't be sure what speeds it really sets. Another reason could lead to such results — our reference code was modified. But we can't examine PLL and we are sure our code stays intact between builds. So let's try to find some different cause of such ARM ALU behavior. We remember that at this stage ALU works on OCRAM — the slowest one of all RAMs. Let's assume that nonlinearity of our clock speed and period measurements is caused by OCRAM — its access speed. What can we do now? We can use caches. According to specification iMX8MP has 32kB of instruction cache and the same amount of data cache. Therefore our tiny 1-2kB application will fit in SoC's cache entirely. Let's turn on both caches by adding this code to our start_64.s file:


	# Load System Control Register (EL3):
	mrs x3, SCTLR_EL3
	# Turn on Data Cache:
	orr x3, x3, #(1 << 2)
	# Turn on Instruction Cache:
	orr x3, x3, #(1 << 12)
	# Set System Control Register (EL3):
	msr SCTLR_EL3, x3

Let's check what we get on default 1GHz now:

ARM ALU at 1GHz with caches on

The changes appeared to be so dramatic, that I had to tune my scope to make pretty-looking picture! Now our reference code executes in 36ms at 1GHz instead of 570ms with caches turned off at the same ALU clock speed! And here is new resulting table, showing that not only is everything faster, but it is also linear in both directions:

Linearity of ARM ALU clock speed (caches on)

Now everything is in order and clear — be it calculated from the lowest value (big to small) or from the highest (small to big) — the values check with each other and correlation is absolutely linear: 200MHz * 3 (600MHZ is three times slower than 200MHz) = 180ms/3 = 60ms; 1800MHz * 9 (1800MHz is nine times faster than 200MHZ) = 1800ms/9 = 20ms and so on. The hypothesis is confirmed — OCRAM timings was messing ARM ALU speeds and now we have formula to set ARM ALU clock speed, we have measurement tool and nice and clean results.

But there is more to explore. As I mentioned earlier, we have a tool for calculating ARM PLL speeds — the spreadsheet. And maybe you noticed that picture above («iMX8MP ARM PLL Settings») shows some clock speeds far above of 1.8GHz. Yes, let's check that and here is the result we've got:

Linearity of ARM ALU clock speed (caches on)

The table contains columns with 1.2GHz and 1.6GHz and this is on a purpose. The thing is that NXP's documentation claims that even 1.8GHz is so-called «overdrive» of this SoC's ALU. And my Linux system (NXP's BSP) confirms that — it works on two speeds only — 1.2GHz and 1.6GHz. These speeds are marked green — as normal or standard. I've measured both standard speeds and the highest I've managed to make SoC to work on (2.2GHz) to show the difference between really used clock speed and maximum I've got out of this SoC's ALU in single core mode.

The result


	Barium No-Boot V0.2 (iMX8MP)
	Build: 22:00:00, Mar 12 2025
	Initial PC: 0000000000920000
	BootROM SP: 0000000000916ED0
	Current EL: 0000000000000003
	Running at: 2200MHz

In previous posts we were reading, planning, preparing, learning, today we've learned how to use GPIO on this SoC, how to configure ARM PLL speed. But today is a special day — we've some nice and exceptional result, which is a real practical result — we've significantly revved up (overclocked) iMX8MP and maybe invented the only one iMX8MP running at 2.2GHz! Disclosing undocumented features is another one benefit of bare-metal studying (or exploring).

You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage II» directory).

16/01/2025

The real «Hello World» from embedder (2): practice, prepare

Chapter 2 Practice, prepare

Between theory and code

In the previous post we've reviewed bare-metal development: what it is, the theory of it, its workflow, what information we need and where to get this information, its profits, bottlenecks and limitations. This — how SoC is organised and how it works, most of all somehow know from theory (school or university) and/or practice (working on a high level or via some HAL). The next post will describe the well known to all of us process — coding. But what is between that areas of theory and practice? What is between the knowing how PLLs and clocks work, knowing that UART is configured by writing some values to some registers and process of writing device-tree nodes and calling functions of HAL? How does that magic of making certain SoC functioning really arise? The process of preparing to code is described in this post — the practice of getting information, gathering it and planning tasks. Excuse me for not feeling sorry for you — not a single line of code will be written in this post, but I will describe this (middle) part of job up to the single bit. This is to let you know clear enough how it is done.

The plan

The plan we need to carry out to reach our goals looks like:
1. Choose the SoC we will work on and get its documentation.
2. Find out the condition BROM leaves SoC in, and see what is initialised for us and what is not.
3. Find out what exactly PLLs and clocks we have to configure to start up hardware-blocks.
4. Configure UART TX and RX pads.
5. Configure UART hardware-block.
6. Output some string.
7. Finish with infinite loop outputting characters received from UART.
8. Make proper boot image and put it in the place BROM of our SoC expects it to be.

The practice

Let's get it started.

1. Choose the SoC we will work on and get its documentation.

We will work on NXP i.MX 8M Plus (iMX8MP). This is multi-core (mine is quad-core) ARM Cortex-A53 (ARMv8-A), SoC (with additional Cortex-M7 core). Some kind of what we need and interesting to play with. Remember and bear in mind what we were talking about in previous post — bare-metal development is strongly tied to certain SoC. Thus, if you are about to develop stand-alone application for any other SoC, then the practice, we will do in this post, is not for you. You can read it as an example of workflow only. Once, we have chosen SoC, we download datasheet describing it. In our case it is «i.MX 8M Plus Applications Processor Reference Manual» (IMX8MPRM.pdf, I have Rev. 3, 08/2024). And, to entertain you a little, here is the SoC itself:

NXP iMX 8M Plus

2. Find out the condition BROM leaves SoC in, and see what is initialised for us and what is not.

To find out the condition BROM leaves SoC in we look for section, which describes what BROM enables and what does not. This is section 6.1.4.2 — «Boot block activation» (p. 748). This section claims that BROM of iMX8MP activates (in addition to some others) these blocks: Clock Control Module (CCM), Input/Output Multiplexer Controller (IOMUXC). We will boot from SD-Card, thus Ultra-Secure Digital Host Controller (USDHC) will be enabled also (but this is obvious). Let's proceed to clocks BROM has initialised for us. In the next section 6.1.4.3 — «Clocks at boot time» (p. 749), table 6-3 — «PLL setting by ROM» we see which PLLs are enabled: ARM PLL at 1GHz, System PLL1 at 800MHz and System PLL2 at 1GHz. Let's remember PLLs we have enabled: System PLL1 and System PLL2. We don't care about ARM PLL at this point, as it controls ALU only and is enabled and configured already by BROM. Proceed to Table 6-6 — «CCGR setting by ROM» (p. 750), and see which clocks are enabled and which are not. Scrolling down to UARTs (p. 752) and see that BROM enables none of UARTs. iMX8MP boards usually use second UART (UART2) for debug, thus, let's remember that its clock number is 74 (CCM_CCGR74).

3. Find out what exactly PLLs and clocks we have to configure to start up hardware-blocks.

From the previous paragraph we see that none of UARTs hardware-blocks is enabled by BROM. Well, we have to find out how to enable it by ourselves. Let's start with CCM structure, it is described in section 5.1 — «Clock Control Module (CCM)» (p. 227). Looking at Figure 5-1 — «CCM Block Diagram» (p. 228) we see that clock ticks pass from clock generators (on the left side) via PLLs (or bypassing them), then to CCM's Clock Root Generator which has clock slices, then clock slices form out clock roots and, finally, come out to hardware-blocks (on the right side). Well, this scheme looks more complicated than that we've discussed in the previous post. The idea of Clock Roots becomes more clear if we look at section 5.1.2 — «Clock Root Selects» (p. 228). Let's scroll down to Slice Index №95 (p. 241). 95 — is the slice Clock Root of UART2. In the column «Source Select» we see that it can be driven by few outputs. We will drive our UART by SYSTEM_PLL2_DIV5. Let's remember its value — 010b. As we know already, System PLL2 is enabled at 1GHz. Here we need to configure its outputs — ensure its DIV5 (1GHz div 5 is 200MHz — we'll need this value later) output is enabled. This is done by configuring System PLL2 General Function Control Register which is described in section 5.1.8.32 — «SYS PLL2 General Function Control Register» (p. 509). We will set all PLL_DIVx_CLKE bits and PLL_CLKE. The address of this register is ANAMIX base + 104h. After we have enabled PLL outputs we have to select proper clock root for UART2 hardware block. This is done by configuring CCM_TARGET_ROOT №95. It is described in section 5.1.7.10 — «Target Register (CCM_TARGET_ROOTn)» (p. 412). We see that here we need to set enable (28th) bit to 1 and MUX (24th-26th) bits to the value we've remembered earlier — 010b. The address of this register is CCM base + 8000h + 95 (slice index we need) * 80h. The resulting value we have to write to the register is 12000000h.

4. Configure UART TX and RX pads.

Well, PLLs and clocks are configured and enabled. Now we have to find out how to configure UART pads. First, let's set proper AF for our UART. Alternative functions are described in table 8.1.1.1 — «Muxing Options» (p. 1287). Let's scroll to UART2 (p. 1307). Here we see that UART2_RX port can be routed to one of these pads: UART2_RXD, SD2_DATA0, SD1_DATA3 and SAI3_TXFS. The first one is what we need. UART2_TX port can be routed to one of these pads: UART2_TXD, SD2_DATA1, SD1_DATA2 and SAI3_TXC. The first one is what we need. Both UART2_TXD and UART2_RXD have mode called ALT0. Let's proceed to section 8.2.4 — «IOMUXC Memory Map/Register Definition» (p. 1344). In this table we need to find our UART2_RXD and UART2_TXD — they are on the bottom of page 1350 and on the top of page 1351 correspondingly. Here we see that their reset values both are 5h (remember that value for a while) and absolute addresses are 30330228h for UART2_RXD, and 3033022Ch for UART2_TXD. Then click on the link in the right column. From section 8.2.4.134 — «SW_MUX_CTL_PAD_UART2_RXD SW MUX Control Register» (p. 1540) and section 8.2.4.135 — «SW_MUX_CTL_PAD_UART2_TXD SW MUX Control Register» (p. 1542) we see that MUX_MODE is represented by lowest 3 bits of this registers. Also, we see that 5h (the value we've remembered recently) corresponds to 101b. That means that both pads we need are routed to pads we don't need — GPIO5 24 and GPIO5 25 in this case. Thus, we have to configure those pads correctly for our needs — set both to zero (ALT0). To achieve that, we need to write zeroes to 30330228h and 3033022Ch to set proper alternative functions for that pads. But that's not all we have to do to make UART pads functioning correctly. In addition to setting AF, we need to configure physical parameters of that pads. This is done by setting two SW_PAD_CTL Registers: UART2 RXD pad control register, section 8.2.4.286 — «SW_PAD_CTL_PAD_UART3_RXD SW PAD Control Register» (p. 1837) and UART2 TXD pad control register, section 8.2.4.287 — «SW_PAD_CTL_PAD_UART2_TXD SW PAD Control Register» (p. 1839). After inspecting these descriptions, we conclude that zero is a good value for both of them. And the last one step we have left to do. UART RX is a little special, because it works as an input function. Thus, we need to select input for it. This is done by configuring DAISY Register, which is represented in section 8.2.4.376 — «UART2_UART_RXD_MUX_SELECT_INPUT DAISY Register» (p. 1922). Here we see that 110b «SELECT_UART2_RXD_ALT0 — Selecting Pad: UART2_RXD for Mode: ALT0» (p. 1923) is what we need. Thus, we'll write 6h to 303305F0h.

5. Configure UART hardware-block.

The last thing we have left is to configure UART hardware-block. It is done by familiar steps like — reading registers values (optional), modifying that values (optional) or forming out them from scratch, writing values to registers, waiting for conditions flags (optional). Actually, UART is a simple hardware-block, thus, I will not explain the specific process of configuring it — you will see it in the code, which will be presented in the next post.

6. Output some string.

After UART is configured and running, the game starts. Now we are prepared and ready to output some strings. This is also done by writing a byte to some address (UART register) and controlling TX empty flag to avoid buffer overrun. Here I'll skip detailed description of this process too — see it in the code.

7. Finish with infinite loop outputting characters received from UART.

We will finish with infinite loop outputting characters received from UART. This is done by controlling RX empty flag and reading received byte (UART register) when flag becomes unset.

8. Make proper boot image and put it in the place BROM of our SoC expects it to be.

To make our application load and run on the SoC we've chosen, we have to prepare proper boot image and put it in the place BROM of our SoC expects it to be. This is done by a tool (mkbb_imx8), which is derived from standard NXPs mkimage_imx8. I will not explain how it works and how it was developed at all, but will show how to use it to generate boot block (and how and where to place it) for our SoC in the next post.

Conclusion

We've made it. Now we have gathered all the information we need to start writing code for iMX8MP SoC and we are ready to proceed. And now you know what lies between the theory and the daily routine of embedded developer. In the next post we will develop the application — we will write in assembly language and C-code, compile, link objects to binary, strip it, make a bootable image of it, and put it in the right place on a storage media. It'll be a very small program that will run on the iMX8MP and on this SoC only. But it will give a platform for learning ARM64 machine. You'll be able to play with the ARMv8 machine from the ground, as in assembly language as in C-code — start (kick) or not start its cores, switch or not switch exception levels, output values of registers, and so on. Finally, we'll have a wide-open window to ARM64 machine!