The Linux Kernel

The Linux Kernel#

Data Structures#

The Linux kernel provides a number of generic data containers to enable code reuse and prevent developers from rolling their own (potentially dodgy) implementations.

Covered here are:

Linked Lists
Queues

This entry aims to have good coverage of the discussed data structures; however, it is not expected to be complete in terms of available structures or each structure’s entire functionality.

Future additions to this entry should include coverage of, but not limited to, the following:

Maps
Binary Trees

Linked Lists#

Implemented in <linux/list.h>

Implementation#

The primary implementation of a linked list in the kernel is circular doubly linked list. Traditionally, this would look like the following:

// Using C++ templates to demonstrate that the linked list can hold arbitrary types
template<typename T>
struct my_data {
    T data;
    struct my_data *next;
    struct my_data *prev;
};

/*
  ┌──────────────────────────────────────────────────┐
  └─>┌──────┐<─┐  ┌────>┌──────┐<─┐  ┌────>┌──────┐<─│─┐
     │ data │  │  │     │ data │  │  │     │ data │  │ │
     ├──────┤  │  │     ├──────┤  │  │     ├──────┤  │ │
     │ next ├──│──┘     │ next ├──│──┘     │ next ├──┘ │
     ├──────┤  │        ├──────┤  │        ├──────┤    │
  ┌──┤ prev │  └────────┤ prev │  └────────┤ prev │    │
  │  └──────┘           └──────┘           └──────┘    │
  └────────────────────────────────────────────────────┘
*/

Rather than converting your type into a linked list by adding pointers to the next and previous elements, the Linux approach is to instead embed a linked list into your type. This is achieved by adding a member of type struct list_head (shown below) to your struct.

struct list_head {
    struct list_head *next, *prev;
}

The approach (effectively) results in the following implementation:

// Using C++ templates to demonstrate that the linked list can hold arbitrary types
template<typename T>
struct my_data 
    T data;
    struct list_head list;
};

/*
     ┌──────┐           ┌──────┐           ┌──────┐
     │ data │           │ data │           │ data │
     ├──────┤           ├──────┤           ├──────┤
  ┌─>│ list │<─────────>│ list │<─────────>│ list │<─┐
  │  └──────┘           └──────┘           └──────┘  │
  └──────────────────────────────────────────────────┘
*/

However, this leads to an issue: how do we access the data member when we get a pointer to the next struct list_head? We do this with the list_entry macro which is simply defined as the container_of macro which is in turn defined as:

#define container_of(ptr, type, member) ({           \
    void *__mptr = (void *)ptr;                      \
    ((type *)(__mptr - offsetof(type, member))); })

#define list_entry(ptr, type, member) \
    container_of(ptr, type, member)

This macro allows you to access the struct my_data as follows:

static LIST_HEAD(my_list); // See Definition and Initialisation section below
// A number of elements are added to the list...
struct list_head *next = my_list->list.next;
struct my_data *next_list_element = container_of(next, struct my_data, list);
/*                                               └┬─┘  └─────┬──────┘  └┬─┘
 * The pointer you have ──────────────────────────┘          │          │
 * The type you want, that your pointer is embedded in ──────┘          │
 * The member within your type that your pointer points to ─────────────┘
 */

Definition and Initialisation#

As mentioned above, a linked list is created by embedding a struct list_head within your structure:

struct my_data {
    int data;
    struct list_head list;
};

and this can be initialised as:

// Allocate a new element
struct my_data *element;
element = kmalloc(sizeof(*element), GPF_KERNEL);

// Initialise the data within the element
element->data = 0;

// Initialise the linked list node
INIT_LIST_HEAD(&element->list);

or, alternatively, at compile time:

struct my_data element = {
    .data = 0,
    .list = INIT_LIST_HEAD(element.list)
};

If you would like a pointer to the list that is not itself an element of the linked list, you can define one as static LIST_HEAD(my_list).

Insertion#

Two functions are available to add nodes to a linked list:

void list_add(struct list_head *new, struct list_head *head)
void list_add_tail(struct list_head *new, struct list_head *head)

These functions will add the given new element to the list after and before the head node respectively.

Note that as this implementation of a linked list is circular, there is no special head node and therefore any element may be passed to the above functions. For example, passing the “last” element in a list as the head to list_add can be used to implement a stack, while passing the “first” element to list_add_tail can be used to implement a queue.

Deletion#

Deleting an element from a linked list is carried out with the list_del(struct list_head *entry) function. This simply rewires the pointers:

void __list_del(struct list_head *prev, struct list_head *next) {
    next->prev = prev;
    prev->next = next;
}

void list_del(struct list_head *entry) {
    __list_del(entry->prev, entry->next);
}

so therefore the user is responsible for freeing any allocated memory owned node before it is leaked.

Traversing#

Lists are traversed by using the following macros:

list_for_each/list_for_each_prev
list_for_each_entry/list_for_each_entry_reverse
list_for_each_entry_safe/list_for_each_entry_safe_reverse

The first pair of macros is used to iterate through the list forwards and backwards respectively. As mentioned above, usually the pointer to the list_head is not especially useful and therefore this is paired with the list_entry macro also defined above.

struct list_head *p
struct my_data *entry;

list_for_each(p, &my_list) {
    entry = list_entry(p, struct my_data, list);
    //...
}

The second pair of macros give an alternative approach to traversal which removes the need for the temporary, unused list_head:

struct my_data *entry;

list_for_each_entry(entry, &my_list, list) {
    //...
}

in the above snippet, the element within the list is directly accessible via entry.

Finally, the last pair of macros are used to safely traverse a list, which means that you’re allowed to remove elements from the list while iterating through it. This is achieved by saving two pointers that prevents a use-after-free bug from occurring when moving to the next node within the list:

struct my_data *entry, *next;

list_for_each_entry_safe(entry, next, &my_list, list) {
    // You may safely free and delete elem here
}

Misc.#

Other functions that may prove useful are:

Replacing an entry – void list_replace(struct list_head *old, struct list_head *new)
- replace the old entry by the new one
Swapping entries – void list_swap(struct list_head *entry1, struct list_head *entry2)
- replace entry1 with entry2 and re-add entry1 at entry2’s position
Checking if a list is empty –int list_empty(struct list_head *head)
- returns nonzero if the list is empty and zero otherwise
Moving – void list_move(struct list_head *list, struct list_head *head)/void list_move_tail(struct list_head *list, struct list_head *head)
- removes list from its list and adds it after or before head respectively
Splicing – void list_splice(struct list_head *list, struct list_head *head)
- adds list pointed to be list after head

Notes#

Due to the implementation of linked lists, list manipulation (addition, deletion, splicing) is constant time (O(1)) while list traversal is of order O(n).

Queues#

Implemented in <linux/kfifo.h>

Implementation#

The Linux implementation of a queue (also known as a FIFO for First-In-First-Out) is the generic kfifo.

The kfifo is implemented as a regular FIFO[1] using a buffer and two offsets to keep track of the data in the FIFO. The two offsets are:

in offset - Where the next enqueued item should go
out offset - Where the next item to be dequeued is

Knowing these offsets, we can infer when the queue is empty by these two offsets being equal.

Two main operations are provided:

Enqueue - Add something to the kfifo
Dequeue - Remove something from the kfifo

enqueue
 ┌───┐
 │   │
 └─┬─┘     ┌─────────────── buffer ────────────────┐
   │       ─────────────────────────────────────────
   │           ┌───┐┌───┐┌───┐   ┌───┐┌───┐┌───┐             ┌───┐
   └── in ──>  │   ││   ││   │...│   ││   ││   │  ── out ──> │   │
               └───┘└───┘└───┘   └───┘└───┘└───┘             └───┘
           ─────────────────────────────────────────
             ^                               ^
             └──────────── in offset         │
                           out offset ───────┘

Definition and Initialisation#

Similarly to the linked list implementation, kfifos can be created statically:

#include <linux/kfifo.h>

const long size = 32 * sizeof(long);
struct kfifo fifo;

// Will return 0 on success
int ret = kfifo_alloc(&fifo, size, GFP_KERNEL);

/* Or if you have a preallocated buffer:
long* buf = kmalloc(size, GFP_KERNEL);

int ret = kfifo_init(&fifo, buf, size);
*/

or dynamically:

#include <linux/kfifo.h>

const long size = 32 * sizeof(long);

DECLARE_KFIFO(fifo, size);
INIT_KFIFO(fifo);

but note that in both cases, the size must be a power of 2. It is also worth pointing out the use of kmalloc and the GFP_KERNEL flag here, these will be discussed in an entry on kernel memory management.

Enqueuing and Dequeuing#

Adding data to a kfifo is easily carried out using the following macro: kfifo_in(fifo, buf, n) where

fifo is the address of the kfifo
buf is the data to be added
n is the number of bytes to copy into the kfifo

On success, the macro will return the number of bytes copied into the fifo, which may range from 0, if nothing could be copied, to n, all data was copied successfully.

To retrieve data from a kfifo, the following macro is used: kfifo_out(fifo, buf, n) where, like above,

fifo is the address of the kfifo
buf is where the data is to be copied to
n is the number of bytes to copy from the kfifo

Note that kfifo_out may copy fewer bytes than requested so it is important to check its return value.

As with normal usage of a queue or fifo, popping data will render it inaccessible to future reads. In such a case where you want to inspect what is on the queue, you can use kfifo_out_peek(fifo, buf, n, offset) where the first three arguments are identical to kfifo_out. The final argument, offset, allows the user to specify which index in the queue to read from; therefore, an offset of 0 will allow a user to view that data that would be copied by kfifo_out.

`kfifo` Inspection#

A number of utility functions are provided to inspect a given kfifo:

kfifo_size(struct kfifo*) - return the size of the queue in bytes
kfifo_len(struct kfifo*) - return the total number of enqueued bytes
kfifo_avail(struct kfifo*) - return the number of bytes free in the queue
kfifo_is_empty(struct kfifo*) - return nonzero if the queue is empty
kfifo_is_full(struct kfifo*) - return nonzero if the queue is full

Freeing#

A given kfifo can be emptied using kfifo_reset(struct kfifo*).

To deallocate a fifo that has been allocated using kfifo_alloc, one must use kfifo_free(struct kfifo*)

io_uring#

With modern hardware, such as PCIe based storage, offering incredible low latency and high IOPS, and the kernel itself gaining the ability to take advantage of such hardware, such as the addition of the multiqueue block layer blkmq, current I/O interfaces are showing their limitations. In addition to this, mitigations put in place to counter side-channel attacks has greatly increased the cost of system calls.

io_uring is a new Linux I/O interface, introduced in Linux 5.1, that aims to address these issues.

Background#

The aim of io_uring is to offer a modern and efficient I/O interface that addresses the issues with Linux’s current native async I/O offering: aio.

Issues with aio include:

Only supports files opened with O_DIRECT and therefore not utilising the page cache.
It may block depending on a number of factors.
The API is inefficient, requiring 104 bytes of memory traffic per request and two system calls.

The goals of io_uring are:

To be easy to use and hard to misuse (like any well designed API).
To be extendable (not just aimed at block I/O but network I/O, etc.)[1].
To be useful, and not just to a subset of applications.
To be efficient and minimise per-request overhead.
To be scalable.

For the author, Jens Axboe, efficiency was of paramount importance. To quote him:

Efficiency isn't something that can be an afterthought, it has to be designed in
from the start - you can't wring it out of something later on once the interface
is fixed.

Implementation#

In order to achieve efficiency and minimise data copies, it is necessary to share the data structures that define I/O between userspace and kernel space. This is achieved through the use of shared (between user and kernel space) memory and single-producer single-consumer ring buffers with the use of memory ordering and barriers.

Data Structures#

On the completion side, we are presented with the Completion Queue Event (CQE):

struct io_uring_cqe {
    __u64 user_data;
    __s32 res;
    __u32 flags;
};

The fields of the io_uring_cqe are quite straightforward:

user_data - copied, untouched by the kernel, from the submission event. This may be a pointer to a buffer for example.
res - this is the result of the I/O operation (similar to read/write, etc.), such as the number of bytes copied or a negative error code.
flags - may carry metadata on the I/O operation but is unused for now.

On submission side, the Submission Queue Event (SQE) is a bit more complicated as it has to convey much more information while also being extensible:

struct io_uring_sqe {
    __u8 opcode;
    __u8 flags;
    __u16 ioprio;
    __s32 fd;
    __u64 off;
    __u64 addr;
    __u32 len;
    union {
        __kernel_rwf_t rw_flags;
        __u32 fsync_flags;
        __u16 poll_events;
        __u32 sync_range_flags;
        __u32 msg_flags;
    };
    __u64 user_data;
    union {
        __u16 buf_index;
        __u64 __pad2[3];
    };
};

The most relevant fields in this structure are (for a full overview, see here):

opcode - the operation described by this sqe, e.g. IORING_OP_READV for a vectored read (like readv).
fd - the file descriptor associated with the request.
off - the offset at which the operation should take place at.
addr - the address at which the operation should perform I/O. For vectored I/O (readv, etc.) this may be a pointer to an array of struct iovec.
len - either the byte count for standard I/O or the length of the iovec array for vectored I/O.

One thing worth noting is that the size of this struct is 64 bytes, which is often the size of a CPU cache line.

Communication Channels#

As mentioned above, io_uring makes use of single-producer single-consumer ring buffers, which are visible both in user and kernel space. In the case of CQEs, the producer is the kernel (when the I/O is ready) the consumer is the user (acknowledging the I/O), while for SQEs the roles are reversed.

For the completion queue ring, only the kernel is modifying the backing data (an array of CQEs). When a new event is posted by the kernel (I/O is complete and a CQE is created), it needs only update the head of the ring buffer. When a userspace application acknowledges a CQE, it need only update the tail of the ring. In this regime, it is simple for a userspace program is probe if CQEs are ready for consumption by checking that the head and tail differ. Also, as the head is represented as an unsigned type, the index of a given CQE is easily computed as head & cqring->mask where the cqring->mask is simply the size of the ring minus 1[2].

Roles are reversed for the submission queue ring, with the userspace application updating the tail and the kernel updating the head. There is a slight difference in the submission queue ring as there is an additional layer of indirection between the submission ring and the shared array of SQEs.

One thing to note is that when the kernel consumes an SQE, that SQE may be reused while the I/O operation is in flight. Therefore, there is the possibility to overflow the completion queue. In an effort to offet this, the completion queue ring is twice the size of the submission queue ring. A user should be aware of this and take care not to produce this overflow condition, which will be tracked by the completion queue..

Low Level Interface#

A brief overview of the low-level interface will be covered here but it is suggested that a user uses liburing.

io_uring currently emits three syscalls:

int io_uring_setup(unsigned entries, struct io_uring_params *params);
int io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t sig);
int io_uring_register(unsigned int fd, unsigned int opcode, void *arg, unsigned int nr_args);

`io_uring_setup`#

This system call is used to setup an io_uring instance. It accepts two arguments: the number of entries (must be a power of 2 in the range [1, 4096]) and an input/output parameter that will be updated by the kernel. Without going into too much detail here (see here for further detail), the io_uring_params struct is used to communicate with userspace the necessary information for setting up the communication with kernel space. The file descriptor returned by this function can be used in conjunction with the returned io_uring_params to mmap the submission and completion rings into userspace. I have not gone into detail here as it is recommended that a user use liburing to carry out these tasks.

`io_uring_setup`#

This system call is used to tell the kernel that SQEs are ready for consumption. The important arguments are:

fd - the file descriptor of the ring.
to_submit - the number of SQEs ready for consumption.
min_complete - allows you to request that the kernel wait until this many I/O requests are complete before returning (IOURING_ENTER_GETEVENTS must be passed in the flags for this to hold).

Miscellaneous Features#

Three further features are worth mentioning:

SQE Ordering

It is possible to submit an I/O barrier and drain the submission queue ring. This may be desirable if you would like to run an fsync after a series of writes. This is achieved by setting IOSQE_IO_DRAIN in the flags field of an SQE.

Linked SQEs

For many cases, a full submission queue barrier as described above may be overkill. To facilitate cases where finer-grain control is desired, it is possible to link SQEs so that the next SQE will not begin until the current one succeeds. If an SQE does not succeed, the chain is broken and linked SQEs are cancelled with the -ECANCELED error code set in the corresponding CQE. To do this, pass IOSQE_IO_LINK in the flags field of an SQE. An example usage could be an implementation of the cp program where a read to a buffer must complete before the corresponding write may start.

Timeouts Waits on the completion ring may be manipulated with the IORING_OP_TIMEOUT opcode. Two timeout triggers are available. The first is an explicit timeout by passing a struct __kernel_timespec in the addr field of an SQE. The second is a trigger on a specific number of completions that is passed in the offset field of an SQE. It is possible to supply both triggers and in which case, the first condition will cause the timeout to trigger.

Memory Ordering#

Before covering memory order concerns, it is worth noting that this information is not necessary when using liburing. When using the raw, low-level interface this section is important.

The two important primitives are:

read_barrier() - Ensure all previous writes are visible before doing subsequent reads.
write_barrier() - Do not reorder previous writes after this.

The primary reason for these barries is that it is possible that your compiler and/or processor will reorder operations outlined by your source code. Where this would be a particular issue is in the generation and submission of SQEs. When generating an SQE, the submission is carried out by incrementing the tail of the submission queue ring. If that operation is reordered before the SQE is fully generated, it is possible that the kernel will begin working with an incomplete SQE, which may have a serious impact on system stability.

Advanced Usage#

Fixed files and buffers#

When a file descriptor is passed to io_uring, the kernel must retrieve a reference to the underlying file and when the I/O is complete, that reference is dropped. This can incur high overhead in certain situations due to the underlying data structures being reference counted within the kernel. To avoid this issue, io_uring offers functionality to register a file descriptor such that the reference is only retrieved once:

int io_uring_register(unsigned int fd, unsigned int opcode, void *arg, unsigned int nr_args);

Here, fd is the file descriptor of the ring, opcode is the type of registration, arg is an array of file descriptors to register, and nr_args is the length of the arg array. The available opcodes are IORING_REGISTER_FILES and IORING_UNREGISTER_FILES.

To perform I/O on a registered file descriptor, a user must set IOSQE_FIXED_FILE in the flags field of the SQE. The, rather than passing a file descriptor in the SQE, the user should set fd to the index of the file descriptor in the args array passed to io_uring_register. Regular I/O (via their file descriptors) is still permitted on registered files.

When using O_DIRECT for I/O, the kernel must pin userspace pages such that they’re not migrated during the I/O operation. This is an expensive operation and if used frequently can become quite costly. io_uring also allows a user to register a fixed buffer using the IORING_REGISTER_BUFFER opcode. In this case, the arg and nr_args arguments to io_uring_register must then be an array of struct iovec to describe the buffer(s) and the length of that array. To use these fixed buffers, the user must use the IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED (or their corresponding liburing functions to prep the SQE) opcodes. When doing fixed buffer I/O, the addr and len fields of the SQE must contain an address and corresponding length of a registered buffer. However, it is not necessary to fill an entire registered buffer on an I/O request; subsets are permitted.

Polled I/O#

io_uring also supports polled I/O which may be preferred to interrupt driven I/O as I/O latencies become ever smaller and the cost of handling an interrupt is greater than simply eagerly waiting. This can also be beneficial for high IOPS applications where a high number of interrupts can degrade system performance.

Polled I/O can be set at ring creation time by adding the IORING_SETUP_IOPOLL flag to io_uring_setup or io_uring_queue_init (see below).

When using polled I/O, it is no longer possible to check for CQE via the completion queue ring as no interrupt will be triggered to update it. Therefore, a user must explicitly look for completed I/O by called io_uring_enter the IORING_ENTER_GETEVENTS flag set and min_complete set to the number of CQEs desired.

Note that it is illegal to issue a non-pollable operation on a polled ring and will result in -EINVAL being returned on ring setup.

Kernel-side polling#

Reducing the number of system calls required to perform I/O is one of the goals of io_uring. This can be pushed to the extreme by setting up your rings with the IORING_SQ_POLL flag set in the flags field of the io_uring_params. Doing this will launch a kernel thread that will poll the submission queue of your ring and therefore facilitate I/O submission without the need to call io_uring_enter at all. The kernel will automatically pick up on newly submitted I/O. Note that this is a privileged operation (a standard user cannot spawn a kernel thread) so initialisation may fail with -EPERM.

In cases of long I/O inactivity, the kernel thread may sleep automatically. In doing so, it will set a flag in the submission queue ring that can inform userspace that the thread will need to be woken. This is carried out by calling io_uring_enter with an added flag as shown below:

// Check if the kernel polling thread is sleeping
if ((*sqring->flags) & IORING_SQ_NEED_WAKEUP)
    io_uring_enter(ring_fd, to_submit, min_complete, IORING_ENTER_SQ_WAKEUP);

The sleeping idle time (in milliseconds) can be tuned at ring creation time using the io_uring_params.

liburing#

liburing has been provided by the io_uring author to provide a much simpler API for users. The GitHub repo includes not only example usages but also a large number of tests. A number of these tests have been added by the author following bug reports and issues that have been raised through GitHub.

Note that while liburing may not expose the full feature set of io_uring currently, it is entirely possible to mix the two interfaces.

Setup and Teardown#

Rather than worrying about making the syscall mention in low-level-interface, mmaping the returned file descriptor, etc. liburing exposes a single function to setup the rings:

struct io_uring ring;
io_uring_queue_init(QUEUE_DEPTH, &ring, 0);

This sets up the submission and completion rings with a depth of QUEUE_DEPTH, which needs to be a power of 2 in the range [1, 4096]. The last argument is the flags argument mentioned previously that is currently unused.

In order to tear down the rings, the io_uring struct can be passed to the teardown function:

io_uring_queue_exit(&ring);

Submission and Completion#

An example of submission and completion is shown below with comments where necessary.

struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;

// Get an SQE and prepare it with a read of file (fd) at offset, to buf, that 
// is nr_bytes long
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, nr_bytes, offset);

// Tell the kernel we have a read prepared
io_uring_submit(&ring);

// Wait on an SQE to complete
io_uring_wait_cqe(&ring, &cqe);

// Use the CQE
handle_cqe(cqe);
// Inform the kernel that we're finished with the CQE
io_uring_cqe_seen(&ring, cqe);

Note that the call to io_uring_prep_read may be swapped out with whatever is necessary. Some examples include:

io_uring_prep_readv - for a scatter/gather read.
io_uring_prep_read_fixed - for a read to an io_uring registered I/O buffer.
io_uring_prep_sendmsg - send a message of a socket.
io_uring_prep_fsync - sync a file to disk.
etc.

For a full list, see here.

Performance#

The GitHub user frevib has created two simple echo servers, one io_uring based and the other epoll based, to demonstrate the performance benefits of using io_uring:

In all benchmarks, where more than a single client is connected to the server, io_uring is shown to outperform epoll in terms of the number of requests handled per second. The full results of this benchmark, including the hardware used, can be found here.

Further benchmarks, including ones carried out with fio can be found here. It is worth noting that these are carried out using kernel 5.6 so they are likely not going to be 100% representative of today’s performance.

Finally, the author has carried out benchmarks that were included in the kernel mailing list. Of note here is the comparison with spdk, a set of tools and libraries aimed specifically at high performance, user-mode storage applications.

Kernel Modules#

Linux kernel modules are sections of code that may be dynamically loaded into, and unloaded from, the kernel to extend funtionality without need to either reboot or rebuild the kernel. Their uses range from small modules to access privileged information to entire device drivers.

Modules give you unrestricted, ring-0[1] access to the machine you’re running on and therefore must be approached cautiously. As code here runs in kernel context, you have no access to the familiar C libraries, limited stack space, and no memory protection. The last point being the most critical as the lack of this safety net allows you to hang a system very easily, risking data loss and system corruption.

/* Here be dragons. */

Example#

A hello world kernel module will be described below. The module is very simple, only printing log messages when the module is loaded and unloaded.

Source#

/* hellomodule.c */
#include <linux/module.h>
#include <linux/kernel.h>

/* 1. Module info */
MODULE_LICENSE("GPL")
MODULE_AUTHOR("Kenneth Hanley")
MODULE_DESCRIPTION("Hello world module")
MODULE_VERSION("0.1")

/* 2. Data and parameters */
static char* name = "world";
/* Accept insmod parameters */
module_param(name, charp, S_IRUGO);
MODULE_PARM_DESC(name, "An example parameter");

/* 3. Initialization and exiting */
static int __init hello_init(void) {
    printk(KERN_INFO "hellomodule: Hello, %s!\n", name);
    return 0;
}

static void __exit hello_exit(void) {
    printk(KERN_INFO "hellomodule: Goodbye, %s!\n", name);
}

/* 4. Registration */
module_init(hello_init);
module_exit(hello_exit);

The main points of this module (with headers referring to points in the mode snippet above) are:

Module info
- Some information regarding the module: the author, the version, etc.
- The important thing here is the license as certain kernel functions are only callable by GPL licensed code and therefore a non-GPL module cannot use that functionality.
Data and parameters
- Some data is set (the name string) that is used within the module
- A module parameter is set using module_param that takes as arguments: the variable which stores the parameter, the type (charp is char pointer) and the permissions.
- The permissions is set as read-only, which is applied to the sysfs entry: /sys/module/hellomodule/parameters/name
Initialization and exiting
- These are the entry and exit points that initialize the module and clean up as it is unloaded.
Registration
- The module entry and exit points are set to facilitate dynamic module loading.

One thing worth noting is the use of printk instead of printf. As the module runs in kernel space, we have no access to the C standard library and therefore must use kernel functions instead, such as printk. The usage is almost identical to printf with the addition of the log level being prepended to the format string. Notice that there is no comma between the log level and the format string. The available log levels (in descending order of severity) are:

KERN_EMERG
KERN_ALERT
KERN_CRIT
KERN_ERR
KERN_WARNING
KERN_NOTICE
KERN_INFO
KERN_DEBUG

Building#

The module can then be built with a simple Makefile that leverages the kernel module build system:

obj-m += hellomodule.o

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

On older kernels, you may see SUBDIRS=$(PWD) instead of M=$(PWD).

Here $(shell uname -r) is used to inspect the current running kernel version. Note that the build process requires that the headers for your current kernel are installed:

Arch: pacman -S linux-headers
CentOS: yum install kernel-devel
Ubuntu: apt install linux-headers-$(uname -r)

Finally, to build the module, just run make. This will compile the .ko, or “kernel object”, that is the final loadable module.

To inspect the module and paramerets modinfo can be used. For example, running modinfo hellomodule.ko gives:

filename:       /home/khanley/modules/hello/hellomodule.ko
version:        0.1
description:    Hello world module
author:         Kenneth Hanley
license:        GPL
srcversion:     669962B2BA76EEE18D040CA
depends:
retpoline:      Y
name:           hellomodule
vermagic:       5.3.0-40-generic SMP mod_unload
parm:           name:An example parameter (charp)

Loading and unloading#

After compilation the module can be loaded with insmod, passing any desired parameters. Note that depmod and modprobe should be preferred in more elaborate settings as those tools handle module dependencies, among other things.

khanley@kernmod:~/modules/hello $ make
khanley@kernmod:~/modules/hello $ sudo insmod hellomodule.ko name="ICHEC"
khanley@kernmod:~/modules/hello $ dmesg | tail -n1
[ 1262.084180] hellomodule: Hello, ICHEC!
khanley@kernmod:~/modules/hello $ sudo rmmod hellomodule
khanley@kernmod:~/modules/hello $ dmesg | tail -n2
[ 1262.084180] hellomodule: Hello, ICHEC!
[ 1267.031789] hellomodule: Goodbye, ICHEC!

Finally, the module can be removed with rmmod hellomodule.

Ring 3 corresponds to user mode and ring 0 to kernel mode. Ring 0 has full access to the CPU and is not limited in the instructions it can execute. Ring 3 has limitations such that it cannot access certain registers, modify page tables, and perform device I/O.

Kernel Development and CI#

GitLab’s Continuous Integration (CI) allows for automation of building and testing of a project when pushed to the remote repository. This is particularly useful as a validation step before merging a feature branch to devel or master.

GitLab CI works by spinning up a Docker Container (on a GitLab Runner) and cloning your repository. Then, each of your configured jobs are executed and reported as succeeding or failing. Docker Containers leverage the kernel of it’s host system, the GitLab Runner in this case. Therefore, CI is not directly compatible with kernel development because it the container cannot load kernel modules on the host system.

In this document, we explore one (possibly of many) means of reconciling these opposing aspirations.

Solution#

We would like to maintain the convenient automation of GitLab CI while also providing an environment that permits kernel development. One way of achieving this is to delegate to some other service from within the CI. Here we will use AWS for this purpose. The idea is the use GitLab CI as an automated wrapper for launching AWS instances and running our kernel-based jobs there. Of course, this comes with a few challenges we will need to address.

YAML File#

image: somelinuxdistro:version

tags:
    - docker

stages:
    - Launch
    - Build
    - Test
    - Cleanup

Our unconventional arrangement means we require some additional unconventional stages to our CI. Namely, I believe it is ideal to have a dedicated job for launching AWS instances and another dedicated job for terminating them.

Helpful Anchors#

.environment_variables:
    aws_variables: &aws_variables
        InstanceType: t3a.small
        KeyName: ${AWS_KEY_NAME}
        KeyPath: ${CI_PROJECT_DIR}/keys
        LaunchTemplate: [some-template]

.snippets:
    - &getInstanceDNS
        INSTANCE_DNS=$(aws ec2 descript-instances --filters "Name=instance-type,Values=${InstanceType}" "Name=instance-state-name,Value=running" --query "Reservations[0].Instances[0].PublicDnsName")
        && INSTANCE_DNS=${INSTANCE_DNS//\"}
    - &prepare_key
        mkdir ${KeyPath} && touch ${KeyPath}/${KeyName}.pem && echo "{AWS_KEY}" >> ${KeyPath}/${KeyName}.pem && chmod 400 ${KeyPath}/${KeyName}.pem
        
.run_script_on_instance: &run_script_on_instance
    script:
        - *getInstanceDNS
        - ssh centos@${INSTANCE_DNS} -i ${KeyPath}/${KeyName}.pem chmod +x build/${scriptFileName}
        - ssh centos@${INSTANCE_DNS} -i ${KeyPath}/${KeyName}.pem sudo ./build/${scriptFileName}

AWS uses keys for access to instances. To run CI jobs involving AWS instances, the following environment variables must be set;

AWS_KEY (from .pem file)
AWS_ACCESS_KEY
AWS_SECRET_KEY

We need to store these keys for use in the CI jobs, but we also do not want the keys to be explicitly stored in the repository. To this end, we make use of GitLab CI’s environment variables. The particular configuration used here is to accommodate both remote and local CI (see local development).

The easiest way to execute code on the AWS instance is to prepare scripts for building, testing, etc.

Launch#

launch_AWS_instance: &launch_AWS_instance
    stage: Launch 
    variables:
        <<: *aws_variables
    before_script:
        <<: *prepare_key
    script:
        - aws ec2 run-instances
            --launch-template LaunchTemplate=${LaunchTemplate}
            --key-name ${KeyName}
            --instance-type ${InstanceType}
        - sleep 90s # We must wait for the instance to be fully running
        - *getInstanceDNS
        - ssh-keyscan -H ${INSTANCE_DNS} >> ~/.ssh/known_hosts
        - scp -i ${KeyPath}/${KeyName}.pem -r ./* [user]@${INSTANCE_DNS}:~/

As part of the launch phase, we wish to clone the repo on the instance or scp the project directory to the instance. Here we have scp’d but cloning may be preferable.

Build & Test#

Now that we have a running AWS instance, we can run our desired jobs. The following serves as a skeleton of an example job;

job:
    <<: *run_script_on_instance
    stage: Build/Test
    dependencies:
        - launch_AWS_instance
    variables:
        <<: *aws_variables
        scriptFileName: /path/to/desired/script
    before_script:
        <<: *prepare_key

Cleanup#

Proper CI should leave the world in the same state is when it began. An AWS instance forgotten for centuries can accrue costs reaching the tens of Euro - so we must take great care in their destruction. A dedicated job, running after the others have completed is enough to fulfil this requirement.

terminate_instance:
    stage: Cleanup
    dependencies:
        - launch_AWS_instance
    script:
        - aws ec2 terminate-instances --instance-ids
            $(aws ec2 describe-instances --filters
            "Name=instance-state-name,Values=pending,running,stopped,stopping"
            "Name=instance-type,Values=${InstanceType}"
            --query "Reservations[].Instances[].[InstanceId]"
            --output text | tr '\n' ' ')
    when: always

Usually, if an antecedent job fails, the runner will finish. This can occur after the AWS instance has launched. It is important the specify when: always so that the cleanup phase is guaranteed to execute on both success and failure.

The above technique is also possible using Vagrant to configure instances by using the Vagrant-AWS plugin.

Local development#

For some, myself included, the ability to run CI jobs locally is important to their development process.

First, we must provide keys to the CI in a way that is also acceptable by the remote CI. We create environment variable for each of the GitLab CI environment variables listed above. The local runner is then executed as follows;

GitLab-runner exec docker \
    --env "AWS_KEY=$AWS_KEY" \
    --env "AWS_ACCESS_KEY=$AWS_ACCESS_KEY" \
    --env "AWS_SECRET_KEY=$AWS_SECRET_KEY" \
    --env "AWS_KEY_NAME=$AWS_KEY_NAME" \
    job_name

To limit the possibility of financially crippling the organisation, it is preferable to create a single job that handles launching the AWS instance, running the build and test jobs, and cleaning up the instance.

Future Entries#

Future entries should include some of the following as an example:

device_drivers.md - How the kernel interacts with hardware
synchronization.md - A reference of synchronization primitives in the kernel: spinlocks, mutexs, etc.
vfs.md - The Virtual File System

The Linux Kernel

Contents

The Linux Kernel#

Data Structures#

Linked Lists#

Implementation#

Definition and Initialisation#

Insertion#

Deletion#

Traversing#

Misc.#

Notes#

Queues#

Implementation#

Definition and Initialisation#

Enqueuing and Dequeuing#

kfifo Inspection#

Freeing#

io_uring#

Background#

Implementation#

Data Structures#

Communication Channels#

Low Level Interface#

io_uring_setup#

io_uring_setup#

Miscellaneous Features#

Memory Ordering#

Advanced Usage#

Fixed files and buffers#

Polled I/O#

Kernel-side polling#

liburing#

Setup and Teardown#

Submission and Completion#

Performance#

Kernel Modules#

Example#

Source#

Building#

Loading and unloading#

Kernel Development and CI#

Solution#

YAML File#

Helpful Anchors#

Launch#

Build & Test#

Cleanup#

Local development#

Future Entries#

`kfifo` Inspection#

`io_uring_setup`#

`io_uring_setup`#