In previous posts we've made minimal but enough for studying and playing basic application which is able to output debug information. It sets a lot of PLLs and clocks except, maybe, for most of us, the most interesting one — ARM PLL, or PLL which controls the speed of ALU. And we will do such research in this part. The announce of this post is partial, this is on a purpose — it will be more interesting for you to read without knowing the complete plan of this post.
The research plan
We can't measure ALU clock directly because it has no outputs outside of SoC. So, how should we act if we want to know if it is at least changed? We need to do something that depends on ALU speed, and what we can measure. What could it be? UART? No. We configure UART to output data on a particular baud rate, but that has nothing in common with ALU speed. UART configuration affects it's TX and RX pins only. Thus, we need different «interface» with SoC which can give us possibility to measure it's clock. Let's turn to theory for a little. So-called ALU «cycle» is one tick of ALU clock, but single clock tick is not equal to instruction, instruction may take from one to thousands of cycles. (Instruction that take zero cycles are out of scope of this post). But we know that any particular instruction takes the same amount of cycles at any ALU speed. Thus, real execution time of any block of instructions will directly vary on ALU speed. We just need to have some measurable points of some «reference» block of instructions. The most obvious block of instructions is empty infinite loop which translates into unconditional branch back to itself even leaving link register intact. What could be the rest of our setup? The most clear way to measure something on a SoC is GPIO. Hence, in this post we will initialise GPIO controller, one of it's banks, one of pads of this bank and put GPIO toggling function into an infinite loop. After that we'll be able to measure impulses with a scope, change ALU PLL speed, leaving reference code intact and measure impulses again.
As we've reviewed the workflow on embedded systems without any BSP, bootloader, kernel — bare-metal, in previous posts, I'll not provide such details in this and further parts. You will see what and how is done in code. Some code lines have comments referencing datasheet page numbers. If you'll ever need to change GPIO number, for example, you can look datasheet around pages numbers that you'll se in code.
The practice
Today we will add two files to our project — gpio.c and gpio.h. I have a handy GPIO on my board — GPIO number 16 on bank 4. So, we add GPIO initialisation and set it's direction to barium_main() function:
/* Init GPIO and set 16th of fourth bank as output: */
gpio_init();
gpio_set_dir(GPIO_BANK4, 16, GPIO_OUT);
After that we add our measurement code — the reference code we have mentioned earlier:
/* Generate impulses for measurements: */
loop_meas:
gpio_toggle_val(GPIO_BANK4, 16);
for (lVal = 0; lVal < IMPULSE_LEN; lVal++);
goto loop_meas;
IMPULSE_LEN can be any, but I've chosen 1000000 to fit the loop to comfortable scope view settings.
Build, write to SD-card, connect scope, start the board and see:
Here we see that period is 568ms width. The period width is uninformative itself. This is just how many time our SoC spends executing our reference cycle on the default ARM PLL settings. From «PLL setting by ROM» we remember that ARM PLL is set at 1GHz. Let's reconfigure it and see what measurements we'll have. ARM cores of iMX8MP are driven by so-called PLL1416x. I could not find any documentation describing it, so we will cope with SoC's datasheet (ARM PLL), see the registers we need to write values to and generate the exact values using the sheet I've managed to create from information I've gathered from around. You'll see the sheet file in the repository and can play with it. The methodology is simple— you change the Main divider (the biggest number in the most left column) with step of 25 and see column «Clock (MHz)». The rest will be done by the sheet's formulas. The table will display register value in hex and dec, and two values — Pre div and Scaler. We can configure ARM PLL by using three values — Main div, Pre div and Scaler in formula, or just write the exact value from the most right columns.
The whole code is added to file clocks.c/clocks.h. And I utilise three values formula instead of a exact single value. That's because I used to do it this way while I was on my researches. Let's rev our SoC up to 1.8GHz — it's claimed maximum (p. 95). According to sheet we will use 225, 3, 0 coefficients:
raw_writel((225 << 12) | (3 << 4) | (0), CCM_ANALOG_ARM_PLL_FDIV_CTL0);
Build, write to SD-card, start the board and see:
And what we see here? 500ms. The period has narrowed, but not as we expected — 568ms@1GHz expected to be 568/1.8 what is 316ms@1.8GHz! What is going on here? To make long story short (it still is not the most interesting part of it) I'll tell that I've measured different speeds and all periods appeared to be non-proportional to ARM clock. The short summary is shown in following table. I've made a little calculations also:
We see clocks we set, periods we have measured and two rows as values should be if reference (real true value) was at the slowest or at the fastest clock. Being calculated in any direction the calculated values have linear relation with clock speed. But measured are non-linear and look like like slightly related to clock speed at all. The table displays that something is really unclear or even wrong.
What could go wrong? NXP PLL1416x is undisclosed, thus, we don't know how it exactly works — we can't be sure what speeds it really sets. Another reason could lead to such results — our reference code was modified. But we can't examine PLL and we are sure our code stays intact between builds. So let's try to find some different cause of such ARM ALU behavior. We remember that at this stage ALU works on OCRAM — the slowest one of all RAMs. Let's assume that nonlinearity of our clock speed and period measurements is caused by OCRAM. What can we do now? We can use caches. According to specification iMX8MP has 32kB of instruction cache and the same amount of data cache. Therefore our tiny 1-2kB application will fit in SoC's cache entirely. Let's turn on both caches by adding this code to our start_64.s file:
# Load System Control Register (EL3):
mrs x3, SCTLR_EL3
# Turn on Data Cache:
orr x3, x3, #(1 << 2)
# Turn on Instruction Cache:
orr x3, x3, #(1 << 12)
# Set System Control Register (EL3):
msr SCTLR_EL3, x3
Let's check what we get on default 1GHz now:
The changes appeared to be so dramatic, that I had to tune my scope to make pretty-looking picture! Now our reference code executes in 34ms at 1GHz instead of 568ms with caches turned off at the same ALU clock speed! And here is new resulting table, showing that not only is everything faster, but it is also linear in both directions:
Now everything is in order and clear, the hypothesis was confirmed — OCRAM timings was messing ARM ALU speeds. But there is more to explore. As I mentioned earlier, we have a tool for calculating ARM PLL speeds — the spreadsheet. And maybe you noticed that picture above («iMX8MP ARM PLL Settings») shows some clock speeds far above of the standard 1.8GHz. Yes, I have checked that and here is what result I've got:
The table contains columns with 1.2GHz and 1.6GHz and this is on a purpose. The thing is that NXP's documentation claims that even 1.8GHz is so-called «overdrive» of this SoC's ALU. And my Linux system (NXP's BSP) confirms that — it works on two speeds only — 1.2GHz and 1.6GHz. I've measured both standard speeds and the highest I've managed to make SoC to work on to show the difference between really used clock speed and maximum I've got out of this SoC's ALU.
The result
Barium No-Boot V0.2 (iMX8MP)
Build: 22:00:00, Mar 12 2025
Initial PC: 0000000000920000
BootROM SP: 0000000000916ED0
Current EL: 0000000000000003
Running at: 2200MHz
In previous posts we were reading, planning, preparing, learning, today we've learned how to use GPIO on this SoC, how to configure ARM PLL speed. But today is a special day — we've some nice and unexpected result, which is a real practical result — we've significantly revved up iMX8MP. Disclosing undocumented features is another one benefit of bare-metal studying (or exploring).
You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage II» directory).