DaftSoft: AArch64

Showing posts with label AArch64. Show all posts

12/03/2025

Barium No-Boot (4): measurements, rev up iMX8MP

Introduction

In previous posts we've made minimal but enough for studying and playing basic application which is able to output debug information. It sets some PLLs and clocks except, maybe, for most of us, the most interesting one — ARM PLL, or PLL which controls the speed of ALU. And we will do such research in this part. The announce of this post is partial, this is on a purpose — it will be more interesting for you to read without knowing the complete plan of this post.

The research plan

We can't measure ALU clock directly because it has no outputs outside of SoC. So, how should we act if we want to know if it is at least changed? We need to do something that depends on ALU speed, and what we can measure. What could it be? UART? No. We configure UART to output data on a particular baud rate, but that has nothing in common with ALU speed. UART configuration affects its TX and RX pins but impulses UART makes while transceiving data are irrelevant with ALU clock. Thus, we need different «interface» with SoC which can give us possibility to measure its clock. Let's turn to theory for a little. So-called ALU «cycle» is one tick of ALU clock, but single clock tick is not equal to an instruction, instruction may take from one to thousands of cycles. (Instruction that take zero cycles are out of scope of this post). But we know that any particular instruction takes the same amount of cycles at any ALU speed. Thus, real execution time of any block of instructions will directly vary according to different ALU speeds. We just need to have some measurable points of some «reference» block of instructions. The most obvious block of instructions is empty loop which translates into branch back to itself (even leaving link register intact) until the countdown variable reaches zero. What could be the rest of our setup? The most clear way to measure something on a SoC is GPIO. Hence, in this post we will initialise IOMUX Controller, one of pads of particular GPIO bank and put GPIO toggling function among with some empty loop (generating measured impulse length) into an infinite loop. After that we'll be able to measure impulses with a scope, change ALU PLL speed, leaving reference code intact and measure impulses again.

As we've reviewed the workflow on embedded systems without any BSP, bootloader, kernel — bare-metal, in previous posts, I'll not provide such details in this and further parts. You'll see what and how is done in the code. Some code lines have comments referencing datasheet page numbers. If you'll ever need to change, for example, GPIO number you can look datasheet around page numbers that you'll see in code and make necessary changes according to appropriate sections of datasheet.

The practice

This time we need a GPIO. As we've noticed in previous posts, GPIO is one of AFs (alternative functions) of a pad or ball. AFs are configured by IOMUX Controller. Therefore, we start with IOMUXC. First, as we did in previous posts, we check the section 6.1.4.2 — «Boot block activation» (p.706) and check if IOMUXC is configured by BROM or not. And we see that we are lucky today because it is initialised by BROM. Thus, we omit initialisation and configuration of this hardware block and proceed straight to GPIO.

Now, we have to see what GPIOs are available and choose one we will use. I'm working on DEBIX Model A SBC. In (most probably fantastic) case you have the same SBC, you can use the picture below and/or the same GPIO as we will use in this post. If you have another SBC, you have to find similar picture for your SBC. What you need is called «pinout» or «header pinout» or «gpio header». This information usually is included in your SBC datasheet or user manual. But sometimes it is easier and faster to search the Internet for something like «_your_sbc_name_ gpio pinout». Let's see what pinout we have:

DEBIX Model-A GPIO header pinout

There is a handy GPIO on this board — GPIO number 11 on bank 1 (pin 29). Today we will add two new files to our project — gpio.c and gpio.h. So, we add GPIO initialisation and set its direction to barium_main() function:


	/* Init GPIO and set 11th of first bank as output */
	gpio_init();
	gpio_set_dir(GPIO_BANK1, 11, GPIO_OUT);

After that we add our measurement code — the reference code we have mentioned earlier:


	/* Generate impulses for measurements */
loop_meas:
	gpio_toggle_val(GPIO_BANK1, 11);
	for (lVal = 0; lVal < IMPULSE_LEN; lVal++);
	goto loop_meas;

IMPULSE_LEN can have any value, but I've chosen 1 000 000 to fit the loop to comfortable scope view settings.

Build, write to SD-card, connect scope, start the board and see:

Measure at default ARM PLL settings

Here we see that period is approximately 570ms. The period width is uninformative itself. This is just how many time our SoC spends executing our reference cycle on the default ARM PLL settings. From «PLL setting by ROM» we remember that ARM PLL is set at 1GHz. Let's reconfigure it and see what measurements we'll have. ARM cores of iMX8MP are driven by so-called PLL1416x. I could not find any documentation describing it, so we will cope with SoC's datasheet (ARM PLL), see the registers we need to write values to and generate the exact values using the sheet I've managed to create from information I've gathered from around and got from my experiments. You'll see the sheet file in the repository and can play with it. The methodology is simple — you change the Main divider (the biggest number in the most left column) with step of 25 and see column «Clock (MHz)». The rest will be done by the sheet formulas. The table will display register value in hexadecimal and decimal, and two values — Pre div and Scaler. We can configure iMX8MP ARM PLL by using three values — Main div, Pre div and Scaler in formula, or just write the exact value from the most right columns.

iMX8MP ARM PLL Settings

The whole code is added to files clocks.c and clocks.h. And I utilise three values formula instead of a exact single value. That's because I used to do it this way while I was on my researches. Let's rev our SoC up to 1.8GHz — its stated maximum (p. 95). According to sheet we will use 225, 3, 0 coefficients:


        raw_writel((225 << 12) | (3 << 4) | (0), CCM_ANALOG_ARM_PLL_FDIV_CTL0);

Now build, write to SD-card, start the board and see:

Measure at 1.8GHz

And what we see here? 507ms. The period has narrowed, but not as we expected — 570ms@1GHz expected to be 570/1.8 what is 316ms@1.8GHz! What is going on here? To make long story short (it still is not the most interesting part of this post) I'll tell that I've measured different speeds and all periods appeared to be non-proportional to ARM clock. The short summary is shown in following table. I've made a little calculations also:

Non-linearity of ARM ALU clock speed

We see clocks we set, periods we have measured and two rows as values should be if reference (real true value) was at the slowest or at the fastest clock. Being calculated in any direction the calculated values have linear relation with clock speed. But measured values are non-linear and look like barely related to clock speed at all. The table displays that something is really unclear or even wrong here.

What could go wrong? NXP PLL1416x is undisclosed, thus, we don't know how it exactly works — we can't be sure what speeds it really sets. Another reason could lead to such results — our reference code was modified. But we can't examine PLL and we are sure our code stays intact between builds. So let's try to find some different cause of such ARM ALU behavior. We remember that at this stage ALU works on OCRAM — the slowest one of all RAMs. Let's assume that nonlinearity of our clock speed and period measurements is caused by OCRAM — its access speed. What can we do now? We can use caches. According to specification iMX8MP has 32kB of instruction cache and the same amount of data cache. Therefore our tiny 1-2kB application will fit in SoC's cache entirely. Let's turn on both caches by adding this code to our start_64.s file:


	# Load System Control Register (EL3):
	mrs x3, SCTLR_EL3
	# Turn on Data Cache:
	orr x3, x3, #(1 << 2)
	# Turn on Instruction Cache:
	orr x3, x3, #(1 << 12)
	# Set System Control Register (EL3):
	msr SCTLR_EL3, x3

Let's check what we get on default 1GHz now:

ARM ALU at 1GHz with caches on

The changes appeared to be so dramatic, that I had to tune my scope to make pretty-looking picture! Now our reference code executes in 36ms at 1GHz instead of 570ms with caches turned off at the same ALU clock speed! And here is new resulting table, showing that not only is everything faster, but it is also linear in both directions:

Linearity of ARM ALU clock speed (caches on)

Now everything is in order and clear — be it calculated from the lowest value (big to small) or from the highest (small to big) — the values check with each other and correlation is absolutely linear: 200MHz * 3 (600MHZ is three times slower than 200MHz) = 180ms/3 = 60ms; 1800MHz * 9 (1800MHz is nine times faster than 200MHZ) = 1800ms/9 = 20ms and so on. The hypothesis is confirmed — OCRAM timings was messing ARM ALU speeds and now we have formula to set ARM ALU clock speed, we have measurement tool and nice and clean results.

But there is more to explore. As I mentioned earlier, we have a tool for calculating ARM PLL speeds — the spreadsheet. And maybe you noticed that picture above («iMX8MP ARM PLL Settings») shows some clock speeds far above of 1.8GHz. Yes, let's check that and here is the result we've got:

Linearity of ARM ALU clock speed (caches on)

The table contains columns with 1.2GHz and 1.6GHz and this is on a purpose. The thing is that NXP's documentation claims that even 1.8GHz is so-called «overdrive» of this SoC's ALU. And my Linux system (NXP's BSP) confirms that — it works on two speeds only — 1.2GHz and 1.6GHz. These speeds are marked green — as normal or standard. I've measured both standard speeds and the highest I've managed to make SoC to work on (2.2GHz) to show the difference between really used clock speed and maximum I've got out of this SoC's ALU in single core mode.

The result


	Barium No-Boot V0.2 (iMX8MP)
	Build: 22:00:00, Mar 12 2025
	Initial PC: 0000000000920000
	BootROM SP: 0000000000916ED0
	Current EL: 0000000000000003
	Running at: 2200MHz

In previous posts we were reading, planning, preparing, learning, today we've learned how to use GPIO on this SoC, how to configure ARM PLL speed. But today is a special day — we've some nice and exceptional result, which is a real practical result — we've significantly revved up (overclocked) iMX8MP and maybe invented the only one iMX8MP running at 2.2GHz! Disclosing undocumented features is another one benefit of bare-metal studying (or exploring).

You can clone the final repository from Barium No-Boot (iMX8MP) (see «Stage II» directory).

17/12/2024

The real «Hello World» from embedder (1): preface, theory

Chapter 1 Preface, theory

Preface

Any learning process eventually comes up to some examples, some practice. In the case of software development, practice starts from so-called «Hello World». «Hello World» is the very first code every software developer on every platform, API, or framework should do on his very first days as his very first steps. Embedded software engineers are not an exception.

There is a lot of «Hello Worlds» out there on the Internet and in books. Those «Hello Worlds» are in any kind of programmings languages, APIs, and frameworks. But the vast majority of those «any kinds» are in pure software — Java, JS, C++, etc. Thus, embedded software developers, those of them who want to study machine architecture, get in touch and play with it, suffer from a lack of examples — they have no starting point.

You can argue with that — there are examples in assembly language for all types of architectures, assembly language should let us study machine architecture and its behaviour. Yes, there is a lot of examples in assembly language, but those examples are for very high (more precisely — top) levels of runtime environment — Linux and Mac user-space. This approach to learning machine architecture doesn't lead to understanding of how it works and how it is organised.

Actually, it gives us a very narrow slit to machine architecture. Even in kernel-space, we are very limited in abilities to learn the machine. That's because all the hardware initialisation is already done for (and before) us by bootloader and kernel itself. The second problem is so-called «concepts and abstractions of operating systems». Those «abstractions» hide a lot of machine architecture from us while «concepts» arise a lot of software that serve an OS itself and make it what it is from user's point of view. Thus, even working in kernel-space, we learn more about these concepts and principles rather than about a machine.

Let's go on and discuss our last chance — MCUs (Micro-Controller Unit), which are cheap, popular, easy to start up and do let us be as close to the machine as possible. MCUs always had a crucial difference with full-fledged CPUs and SoCs (System on Chip) — they have no MMU (Memory Management Unit) and usually are single-core ALUs. After the beginning of the ARM64-era — the times when not only phones and tablets but also desktops (and even servers) and laptops are running on ARM64, MCUs eventually have got another one drawback — ARM32 (the most common architecture of worth anything at all MCUs) architecture became outdated even as a single-core platform for studying. Now we are living in the days when it's almost useless to learn ARM32 architecture. It's more a waste of time because the difference between actual and almost omnipresent ARM64 and rapidly becoming obsolete ARM32 is enormous.

Thus, we come to the point where we need a comfortable, big enough window (let's say — better a door or a nice gate) to the machine that represents modern ARM64 architecture. Usually it is done by programming for an emulator. Developing for an emulator is like a game because of its abstraction. Games (and gaming) give you minimal or even zero risks by the price of insignificant wins. To get real valuable practice and results, as well as satisfaction, we need a real «Hello World» on real hardware.

Learning a machine architecture (and its behaviour) is done by so-called «Bare Metal» development. «Bare» means that we are working on a machine without an OS, without a kernel, and even without a bootloader. Actually, we develop code that works instead of a bootloader or runs where and when the very first part of user bootloader (SPL — Second Platform Loader) usually does. «Metal» stands for hardware or machine and means that we work on a real hardware — not a virtual machine or an emulator.

In these topics we'll cut a good window for studying modern ARM64 machine. This will be the «Hello World» from embedded engineer and a nice toy for embedded developer.

Theory

Let's start with a little overview of a computing machine, specifically about its heart — CPU or SoC. SoC provides a set of functions (functionalities); each functionality is provided by an according and separate hardware-block. Thus, SoC consists of hardware-blocks or represents a set of hardware-blocks. Each hardware-block is driven by a clock; each clock is derived from PLL, Phase Lock Loop — a small circuit that generates one or more frequencies from an input clock source — an oscillator. So, turning a hardware-block on, besides powering it, is just enabling the clock it is connected to. Actually, you will not find such element like «clock» on your board. «Clock», as an element — is just a concept of embedded developers. Really, «clock» — is one of the outputs of PLL, which can be enabled or disabled by software. Real clocking sources are oscillators and outputs of PLLs. You can see overview of clock generating scheme below.

Clock generating scheme

But enabling clock is not enough to start using functionality of hardware-block. In most cases there are two more things. The first — usually, SoCs provide more functionalities than they have pins, or, in terms of embedded software, more often we say — pads. It leads to so-called «alternative function» (AF) concept. This means that some pads, as we say in software theory — can be configured to specific (alternative) function — I2C SDA or UART TX, for example. Configuring of AFs is done by Input/Output Multiplexing Controller (IOMUXC) or, more commonly called Pin Multiplexer (PINMUX). PINMUX is a built-in hardware-block. So, it has to be turned on (clocked) too, like other hardware-blocks. From the hardware point of view, PINMUX simply routes signals from SoC's pads to inputs/outputs of specified internal hardware-block — to I2C bus controller or to UART controller in example above. In other words, PINMUX switches signals between hardware-blocks inside of a SoC on one end and SoCs pins/balls (or pads) that we can see on the chip package. You can see example of alternative functions multiplexing scheme below.

Alternative functions multiplexing scheme

Second thing is configuration of hardware-block. This is done by code. After hardware-block is clocked (enabled) it stays in some default condition. We need to write specific values to specific registers to make it function according to our needs.

Clocks for hardware-blocks and corresponding PLLs are enabled by code. Hardware-blocks are configured by code too. But no program can run on a non-clocked CPU. So who and how starts the main PLLs and clocks that drive the ALU? This functionality is hardcoded into so-called BootROM, or BROM. BROM is microcode of SoC itself; it is located inside SoC. BROM initialises the essential hardware minimum required to start user code. It starts up a minimal amount of hardware-blocks and after that performs as FPL (see below). BROMs are very different in functionality — some of them just start code from built-in memory, not even loading it into RAM (like BROMs of MCUs do), some of them initialise a bunch of hardware-blocks and can even operate filesystems (like BROM of Raspberry's Broadcom does). The most common condition BROM leaves SoCs in is: one ALU core is started, SRAM (or OCRAM — OnChip RAM) initialised, and one of the boot sources initialised. Initialisation of the rest functionality is left for user code.

Any SoC uses its own, proprietary FPL (First Platform Loader). FPL is a part of BROM we've discussed earlier. To run our code on a SoC we need so-called «boot image», which is made out of our code. We need to say few words about contents of boot image. It is not a regular file we get from compiler. And the architecture mismatch is not the only reason. Usually compiler builds ELF (Executable and Linkable Format) file. ELF-file contains a lot of extra information which is used by OS or some other environment code is run in. For example, ELF-file can include debug information, symbol names, etc. But while working on bare-metal we have to get rid of all OS-specific, debug and other environmental information. This is because SoC will execute raw code only and treat any data as straight, linear stream of code (with some addition of data). Bare-metal needs raw code to function correctly. If we try to load whole ELF-file as a bare-metal application, most probably we'll get some kind of illegal instruction exception because that extra information in ELF-file will not match with correct instructions codes. The process of making raw code, or raw binary from ELF-file is called stripping.

Now, let's proceed with FPL. Any FPL expects boot-image in a specific format and in a specific place. Thus, after we've compiled our source code to (set of) object files, linked that objects into single binary, and stripped it down to pure machine code, we have to pack it and make it comply with specific requirements of a particular SoC and its BROM. This is done by a tool, which usually is named «mkimage», sometimes like «mkimage_XXX», where XXX is replaced by the name of SoC or the name of a family of SoCs. Usually, mkimage adds some headers, which BROM reads, and sometimes CRCs, to machine code. BROM uses this information to verify the boot image. Then we have to put this image in a particular place — usually on a SD-Card or eMMC with some offset from zero. Offset is needed to make bootable media also usable as storage media — it leaves space for partition table and filesystems.

This is what we have about hardware. Let's proceed to software. There's not too much to do and discuss here. We can work in assembly language forever — this is interesting and can be useful. By limiting ourself to assembly language only we can omit using stack. But anyway, at some point we want to dive out into C-code. At this point we'll need stack because C-compiler will use it intensively. Thus, we have to initialise it — that's all we need to know at the moment. Initialisation of the stack is done by setting the stack pointer register to a value representing a valid memory address. Stack (more often) grows down, so we have to choose an address for it according to its behaviour — to prevent it from touching the bottom of RAM and from destroying our application in case it is set above. And, of course, we can't set it higher than the top of accessible RAM.

Now let's proceed to our specific task. Let's say we want to make our SBC (Single-Board Computer) output a «Hello World». How can we do that? Let's confess that outputting (drawing) strings on a display is a little complicated task for a bare-metal beginner. So we'll output our «Hello World» via the most common debug interface all over the world — UART. UART functionality is provided by a particular hardware-block. So we need to start clocking that hardware-block, set its pads and configure it.

First, to enable a clock for UART we have to know which clock exactly we need to enable on the exact SoC. As we mentioned earlier, every clock is derived from a particular PLL. Hence, we need to find out what PLL provides the clock we need. All this information is presented in a datasheet and is rigidly tied to a certain SoC. The second part is to set AF — to configure pads for UART — its TX and RX. Here we need to turn on PINMUX (PLL, clock) and configure pads we need to functions we need. And the last part — configuration is done by writing values to a memory location mapped to an address of hardware-block — PINMUX (pads) and UART (baud rate, parity, etc.) in our case. What configuration must be written is described in a datasheet and, again, is rigidly tied to a certain SoC.

After writing this code we have to compile, link and strip the resulting file to raw machine code, make boot-image by forming out BROM header and adding it to machine code we've got on the previous stage. Then put our boot-image to specific place on a specific media and power on SBC.

Theory ends at this point. In the next part, we'll choose specific SoC, gather information needed to design the «Hello World» for the chosen SoC. With this information we will form out the plan of steps for third part.

28/07/2024

Kernel-Space: asm _ko Hello!

Long time no see, confreres.

This time we will do some crazy (or maybe even mad) things.

As we all know, Linux kernel is a cross-platform software. It supports a very large variety of hardware platforms. Thus it’s written in the most «inter-platform» language in the world — Plain C. The secret of cross-platformness is a compiler used to build the kernel itself and its modules. Exceptions are small parts that provide the only hardware specific functionality that cannot be written in C — SoCs/CPUs start-up and configuration routines. These small parts of kernel are written in assembly languages.

But today I’ll present some sort of skeleton of the whole kernel module written in assembly language — not a single .c-file is used. I had this idea for a long time, I’ve searched the Internet a lot of times for some examples or at least a starting point or a discussion. But had no luck — no examples for ARM nor x86 or any other architecture. This fact proves the craziness of my idea (possibly also it explains why you should not do this in your practice). But I’ve managed to do this.

This module’s functionality will be limited to some kind of «Hello World» — it will be able to be loaded and unloaded correctly and do the only thing — print messages on these events. I’ve mentioned that no .c-files are used in this module, but this module looks like a usual kernel module (except the language it is written in) and is built and works like any other module — thus it is a regular code which you can work with like your everyday routine — no any sort of magic nor discomfort. We will talk about ARM64 (or AArch64) but it can be easily rewritten for any other architecture.

I said that there is no magic in this module, but… we know that any object file must have some sections, information and has to be built — compiled and linked according to strict rules. Kernel module is no exception. That was the «magic» I had to reveal to achieve the goal of my idea — regular-looking and developer-friendly kernel module in assembly language.

There is a lot of «magic» in kernel build system. But my idea was to write a template of a kernel module that will look, feel, act and work like a regular one, as regular part of kernel source tree. Thus today we will not talk about kernel’s build system, differences between linking user-space objects and linking kernel-space objects — but about template of a kernel module in assembly language only (and Makefile for it, of course). Building — compiling and linking will be done by standard kernel build system.

Kernel module project consists of two parts — Makefile and source code. Let's start with Makefile. Everything is clear enough here — you just set source file name to your assembly (.s or .S) file and set obj-m variable. That's all — the rest of the job will be done by kernel build system. Here is our Makefile:

ifeq ($(KERNEL_SRC),)
$(error Specify KERNEL_SRC directory)
endif

export ARCH := arm64
export CROSS_COMPILE ?= aarch64-linux-gnu-
PWD := $(shell pwd)

PROJECT_NAME := asm_ko_hello
$(PROJECT_NAME)-src := $(PROJECT_NAME).S
obj-m += $(PROJECT_NAME).o
AFLAGS_$(PROJECT_NAME).o := -DPROJECT_NAME=$(PROJECT_NAME)

all:
	make -C $(KERNEL_SRC) M=$(PWD) modules

clean:
	make -C $(KERNEL_SRC) M=$(PWD) clean

Now, let's have a look at asm_ko_hello.S (capitalised extension means that this file will be passed via preprocessor):

#include "linux/kern_levels.h"

Here you can see something familiar to kernel modules you have worked with and guess that we will have standard levels of output and you are right — we will have all that standard KERN_XXXX output levels in our assembly code.

#if !defined (PROJECT_NAME)
	#error You must define project name for this template. Stopping build.
#endif

#define MAKE_FN_NAME(x, y) x##_##y
#define FN_NAME(project, func) MAKE_FN_NAME(project, func)

The above lines serve project's template and are not related to today's topic.

.section .text
FN_NAME(PROJECT_NAME, init):
	stp x29, x30, [sp, -16]!
	adrp x0, .loaded
	mov x29, sp
	add x0, x0, :lo12:.loaded
	bl _printk
	mov w0, 0
	ldp x29, x30, [sp], 16
	ret

FN_NAME(PROJECT_NAME, exit):
	stp x29, x30, [sp, -16]!
	adrp x0, .unloaded
	mov x29, sp
	add x0, x0, :lo12:.unloaded
	bl _printk
	ldp x29, x30, [sp], 16
	ret

This is the code of common «Hello World» kernel module. It is put into standard .text section. FN_NAME macro will produce functions names. And after go two functions bodies in standard/usual ARM64 assembly language. As you can see we preserve registers, load string address and call (in term of ARM it's called «branch with link») printk() function, restore registers and, in case of _init(), return value (which actually is an abstraction).

The code looks clear and familiar, but this is kernel module and there is a little difference with user-space program. We should let the system know where are the entry and leave points (or load/unload functions) of our module. In C it's done by two macros module_init() and module_exit(). In our case it would look like:

module_init(asm_ko_init);
module_exit(asm_ko_exit);

But how should we specify these functions in assembly language? What's covered under those macros? Actually nothing too complicated. We just need to declare global functions (symbols) init_module and cleanup_module. To give them proper payload (symbol itself is just a symbol in object file) we specify aliases to our _init() and _exit() functions with .set directive. The whole part of this code is in snippet below:

.global init_module
.global cleanup_module
.set init_module, FN_NAME(PROJECT_NAME, init)
.set cleanup_module, FN_NAME(PROJECT_NAME, exit)

We can't omit the .data section. This one is absolutely standard. Here is our section with strings used for our output:

.section .data
.unloaded:
	.string KERN_INFO MODULE_NAME": unloaded.\n"

.loaded:
	.string KERN_INFO MODULE_NAME": successfully loaded.\n"

There is something that the kernel will not accept our module without. This is something new for developers working in user-space and familiar for kernel-space developers. For Linux kernel we have to specify one mandatory parameter that cannot be omitted — the license. In C it is done by MODULE_LICENSE() macro and would look like:

MODULE_LICENSE("GPL");

Let's see how it is done in assembly language. Maybe you've expected something serious here — some special codes or sequences. But it's easy too — this is just .modinfo section containing information about module in a very simple (and unexpectedly ridiculous) format. See self-explaining snippet below:

#define MODULE_NAME	"Kernel Module in Assembly"
#define MODULE_VER	"1.0"
#define MODULE_AUTHOR	"Daft Soft, 2024"

.section .modinfo, "a"
	.string "author=" MODULE_AUTHOR
	.string "version=" MODULE_VER
	.string "description=" MODULE_NAME
	.string "license=GPL"

That's all about code. The repository is here. According to our Makefile, if you cross-compiler is aarch64-linux-gnu-, you build module as follows:

make KERNEL_SRC="PATAH/TO/YOUR/KERNEL/lib/modules/VERSION/build" clean all

Test it:

modinfo ./asm_ko_hello.ko  
filename:       ./asm_ko_hello.ko 
author:         Daft Soft, 2024
version:        1.0
description:    Kernel Module in Assembly
license:        GPL
srcversion:     48259120F9222D4D7B9D8E7
depends:         
name:           asm_ko_hello
vermagic:       6.1.55 SMP preempt mod_unload modversions aarch64

insmod ./asm_ko_hello.ko  
Kernel Module in Assembly: successfully loaded. 

lsmod 
Module                  Size  Used by 
asm_ko_hello           16384  0 

rmmod asm_ko_hello
Kernel Module in Assembly: unloaded.

P.S.
High level programming languages do a lot of job for developer and prevent a lot of mistakes. Actually not prevent but don't allow to make mistakes — you don't have enough tools to do most dummy things. When you write in assembly language in user-space it's a risk, but it's a funny walk in comparison to working in kernel-space. So you have to be extremely cautious working in assembly language in kernel-space because your any typo will be compiled and executed. For example, slight shift of stack pointer may lead to huge troubles — broken file system is a case. You have been warned — it's your decision how hard you wanna play.