invalid elf header magic / rook-ceph / k8s

Written by
Walter Doekes
Published on 2025-05-31

Yesterday, we noticed some previously unseen messages in the logs: Invalid ELF header magic: != \x7fELF. Of course an ELF binary should start with a valid header. Who/what is trying to run something else?

These messages appeared on Kubernetes nodes that were recently installed and not yet fully in use for production workloads. Coincidentally, these particular nodes were running Ubuntu 24.04 (Noble Numbat), while the rest of this cluster was still on older versions.

Yes, we're already well into 2025. Upgrading from Jammy (22.04) to Noble (24.04) has been less pain-free than the upgrade from Focal (20.04) to Jammy (22.04), so we are a bit behind. If others have been slow to upgrade as well, that might explain why so many new application bugs turn up this late.

In dmesg the “Invalid ELF” message was clear and readable. In journalctl it reads [33B blob data]:

# journalctl -S '2025-05-30 11:00' --grep 'ELF'
mei 30 11:00:25 noble.example.com kernel: [33B blob data]
mei 30 11:05:30 noble.example.com kernel: [33B blob data]
mei 30 11:10:41 noble.example.com kernel: [33B blob data]

journalctl makes sure no unprintable garbage goes to the screen — in this case the \x7f. But that replacement replaces everything, including the printable ASCII.

Now that I knew what log line to look for, I checked surrounding log lines. There, in the middle, there's the kernel: [33B blob data]. Before it, containerd shows that a container is started:

kubelet[1605]: I0530 11:05:30.172379    1605 scope.go:117]
  "RemoveContainer" containerID="<id>"

containerd[1717]: time="2025-05-30T11:05:30.173674804+02:00" level=info
  msg="CreateContainer within sandbox \"<id>\" for container &ContainerMetadata{Name:csi-rbdplugin,Attempt:853,}"

containerd[1717]: time="2025-05-30T11:05:30.182002680+02:00" level=info
  msg="CreateContainer within sandbox \"<id>\" for &ContainerMetadata{Name:csi-rbdplugin,Attempt:853,} returns container id \"<id>\""

containerd[1717]: time="2025-05-30T11:05:30.182178318+02:00" level=info
  msg="StartContainer for \"<id>\""

ctre[1581]: C 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0

systemd[1]: Started cri-containerd-<id>.scope - libcontainer container <id>.

containerd[1717]: time="2025-05-30T11:05:30.218713573+02:00" level=info
  msg="StartContainer for \"<id>\" returns successfully"

ctre[1581]: T 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0

kernel: [33B blob data]

systemd[1]: cri-containerd-<id>.scope: Deactivated successfully.

containerd[1717]: time="2025-05-30T11:05:30.258058437+02:00" level=info
  msg="received exit event container_id:\"<id>\"  id:\"<id>\"  pid:1942336  exit_status:255  exited_at:{seconds:1748595930  nanos:257930298}"

ctre[1581]: t 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0

systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-<id>-rootfs.mount: Deactivated successfully.

containerd[1717]: time="2025-05-30T11:05:31.547087240+02:00" level=info
  msg="shim disconnected" id=<id> namespace=k8s.io

containerd[1717]: time="2025-05-30T11:05:31.547113810+02:00" level=warning
  msg="cleaning up after shim disconnected" id=<id> namespace=k8s.io

containerd[1717]: time="2025-05-30T11:05:31.547119691+02:00" level=info
  msg="cleaning up dead shim" namespace=k8s.io

kubelet[1605]: I0530 11:05:32.203748    1605 scope.go:117] "RemoveContainer" containerID="<id>"

ctre sidenote

There's also a ctre application log in there:

ctre[1581]: T 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0

ctre is our custom tool that keeps track of the starting and stopping of images in containerd.

Docker and Kubernetes tend to accumulate lots of old images in containerd. These unused images aren't just disk space hogs; they can also degrade the performance of the underlying ZFS filesystem. We needed a smarter way to prune.

While common methods like crictl prune aggressively clear everything they can (barring currently in-use images), this isn't always ideal.

Imagine a periodic cron job relying on a large image; if it's pruned every time the job isn't running, the image has to be downloaded anew for the next execution.

ctre addresses this by recording image usage timestamps. Our nightly ctre prune then selectively keeps images likely to be reused, while purging those that haven't seen recent activity.

When looking at the logs, this log line provided an additional hint as to the problem's source.

bpftrace to find the execve

The “Invalid ELF” errors were appearing in the logs regular as clockwork, so a good place to start seemed to be to look for any execve() calls. Maybe a non-zero return value would point us right to the culprit.

bpftrace deserves a blog post of its own. One thing it can do, is do strace-like functionality, but for all processes. Instead of strace -e execve on a single application, I do this:

# bpftrace -e '
tracepoint:syscalls:sys_enter_execve
{
  printf("%s.%d [%d] execve (%s) ", strftime("%H:%M:%S", nsecs), nsecs % 1000000000, pid, comm);
  join(args->argv);
}'

tar (child): xz: Cannot exec: No such file or directory
...
11:46:40.231050035 [2826375] execve (zfs-driver) zfs list -d 1 -s name -o name,guid,available,used -H -p
11:46:40.231597246 [2826376] execve (zfs) chroot /host /sbin/zfs list -d 1 -s name -o name,guid,available,used -H -p
...

This will show all execve() calls happening on the system.

You can safely ignore that xz: Cannot exec. But if you want to get rid of it, see the tips in the bpftrace-deb README:

Another thing to note is the comm — the argument between parentheses like (zfs-driver) — which looks like the parent process. And that is not because it is the parent but because the parent forked/cloned itself before becoming the new process: when your bash shell invokes a new process, it first fork()s so there is a bash parent and a bash child. The child then uses execve() to become the new process. The bpftrace log shows the moment just before the transition. That explains that comm looks like the parent.

The log messages were generally about 5 minutes apart, so I didn't have to wait long for the next message to appear. And indeed, I didn't have to wait very long. The log listed the next error:

[vr mei 30 11:46:40 2025] Invalid ELF header magic: != \x7fELF

And around that time, these suspicious looking execve calls were made.

11:46:40.231597246 [2826376] execve (sh) modprobe rbd
11:46:40.231597246 [2826376] execve (sh) umount

csi-rbdplugin in CrashLoopBackoff

That reminds me, I did see some csi-rbdplugin rook-ceph Kubernetes pods in CrashLoopBackOff state.

That was no coincidence. All failing pods were on the new Noble nodes. The pods running on Jammy were fine.

I ran modprobe rbd manually on the node and waited for the next cycle.

11:51:44.533278002 [2826376] execve (sh) modprobe nbd

I imported the nbd module as well. Now the all the pods started successfully, and the ELF error disappeared from the logs.

Old modprobe on new system

Now everything fell into place: during initialization, the ceph pods try to load rbd and nbd modules. But on the newer systems they couldn't.

First, let's look at the modprobe version on the host system:

root@noble.example.com# modprobe --version
kmod version 31
+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL

Next, the modprobe version inside the quay.io/cephcsi/cephcsi:v3.8.0 image:

sh-4.4# modprobe --version
kmod version 25
+XZ +ZLIB +OPENSSL -EXPERIMENTAL

Here's the critical difference. Previously, on the older Jammy systems, the kernel modules were not compressed:

root@jammy.example.com# find /lib/modules/$(uname -r) -name rbd*
/lib/modules/5.15.0-116-generic/kernel/drivers/block/rbd.ko

But now, on the Noble systems, they were compressed with ZSTD:

root@noble.example.com# find /lib/modules/$(uname -r) -name rbd*
/lib/modules/6.8.0-60-generic/kernel/drivers/block/rbd.ko.zst

To confirm the kmod incompatibility directly from within the container, I did a quick test:

sh-4.4# cat /etc/os-release
NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"

sh-4.4# find /lib/modules/6.8.0-60-generic/ -name ipta*
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_raw.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_security.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_filter.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_nat.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_mangle.ko.zst

sh-4.4# modprobe iptable_nat
modprobe: ERROR: could not insert 'iptable_nat': Exec format error

[vr mei 30 12:36:09 2025] Invalid ELF header magic: != \x7fELF

Yes, that test confirms that modprobe was the culprit. The older kmod version (25) inside the container does not cope with the ZST compressed kernel modules found on the Noble host. Manually preloading the rbd and nbd modules on the nodes was the easiest fix for now.

Technical nuance

While I was initially looking to blame an execve call for directly triggering the “Invalid ELF header magic” error, it turns out that's not strictly possible. An execve call attempting to run something that is not a valid binary will typically return ENOEXEC (8) with the message “Exec format error” to the calling process. No kernel error would be logged.

The kernel error we observed is actually only triggered internally by the kernel during a module load failure. This can be seen in the kernel source code for module loading:

/*
 * Check userspace passed ELF module against our expectations, and cache
 * useful variables for further processing as we go.
...
 */
static int elf_validity_cache_copy(struct load_info *info, int flags)
{
// ...
        if (memcmp(info->hdr->e_ident, ELFMAG, SELFMAG) != 0) {
                pr_err("Invalid ELF header magic: != %s\n", ELFMAG);
                goto no_exec;

However, tracing the execve calls was still crucial as it directly pointed me to modprobe as the culprit process. So, while the initial assumption about the direct trigger mechanism was slightly off, the debugging tactic was ultimately effective.

invalid elf header magic / rook-ceph / k8s