invalid elf header magic / rook-ceph / k8s
Yesterday, we noticed some previously unseen messages in the logs:
Invalid ELF header magic: != \x7fELF
. Of course an ELF binary should
start with a valid header. Who/what is trying to run something else?
These messages appeared on Kubernetes nodes that were recently installed and not yet fully in use for production workloads. Coincidentally, these particular nodes were running Ubuntu 24.04 (Noble Numbat), while the rest of this cluster was still on older versions.
Yes, we're already well into 2025. Upgrading from Jammy (22.04) to Noble (24.04) has been less pain-free than the upgrade from Focal (20.04) to Jammy (22.04), so we are a bit behind. If others have been slow to upgrade as well, that might explain why so many new application bugs turn up this late.
In dmesg the “Invalid ELF” message was clear and readable. In
journalctl it reads [33B blob data]
:
# journalctl -S '2025-05-30 11:00' --grep 'ELF'
mei 30 11:00:25 noble.example.com kernel: [33B blob data]
mei 30 11:05:30 noble.example.com kernel: [33B blob data]
mei 30 11:10:41 noble.example.com kernel: [33B blob data]
journalctl makes sure no unprintable garbage goes to the screen — in
this case the \x7f
. But that replacement replaces everything,
including the printable ASCII.
Now that I knew what log line to look for, I checked surrounding log
lines. There, in the middle, there's the kernel: [33B blob data]
.
Before it, containerd shows that a container is started:
kubelet[1605]: I0530 11:05:30.172379 1605 scope.go:117]
"RemoveContainer" containerID="<id>"
containerd[1717]: time="2025-05-30T11:05:30.173674804+02:00" level=info
msg="CreateContainer within sandbox \"<id>\" for container &ContainerMetadata{Name:csi-rbdplugin,Attempt:853,}"
containerd[1717]: time="2025-05-30T11:05:30.182002680+02:00" level=info
msg="CreateContainer within sandbox \"<id>\" for &ContainerMetadata{Name:csi-rbdplugin,Attempt:853,} returns container id \"<id>\""
containerd[1717]: time="2025-05-30T11:05:30.182178318+02:00" level=info
msg="StartContainer for \"<id>\""
ctre[1581]: C 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0
systemd[1]: Started cri-containerd-<id>.scope - libcontainer container <id>.
containerd[1717]: time="2025-05-30T11:05:30.218713573+02:00" level=info
msg="StartContainer for \"<id>\" returns successfully"
ctre[1581]: T 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0
kernel: [33B blob data]
systemd[1]: cri-containerd-<id>.scope: Deactivated successfully.
containerd[1717]: time="2025-05-30T11:05:30.258058437+02:00" level=info
msg="received exit event container_id:\"<id>\" id:\"<id>\" pid:1942336 exit_status:255 exited_at:{seconds:1748595930 nanos:257930298}"
ctre[1581]: t 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0
systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-<id>-rootfs.mount: Deactivated successfully.
containerd[1717]: time="2025-05-30T11:05:31.547087240+02:00" level=info
msg="shim disconnected" id=<id> namespace=k8s.io
containerd[1717]: time="2025-05-30T11:05:31.547113810+02:00" level=warning
msg="cleaning up after shim disconnected" id=<id> namespace=k8s.io
containerd[1717]: time="2025-05-30T11:05:31.547119691+02:00" level=info
msg="cleaning up dead shim" namespace=k8s.io
kubelet[1605]: I0530 11:05:32.203748 1605 scope.go:117] "RemoveContainer" containerID="<id>"
ctre sidenote
There's also a ctre application log in there:
ctre[1581]: T 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0
ctre is our custom tool that keeps track of the starting and stopping of images in containerd.
Docker and Kubernetes tend to accumulate lots of old images in containerd. These unused images aren't just disk space hogs; they can also degrade the performance of the underlying ZFS filesystem. We needed a smarter way to prune.
While common methods like crictl prune
aggressively clear everything
they can (barring currently in-use images), this isn't always ideal.
Imagine a periodic cron job relying on a large image; if it's pruned every time the job isn't running, the image has to be downloaded anew for the next execution.
ctre
addresses this by recording image usage timestamps. Our nightly
ctre prune
then selectively keeps images likely to be reused, while
purging those that haven't seen recent activity.
When looking at the logs, this log line provided an additional hint as to the problem's source.
bpftrace to find the execve
The “Invalid ELF” errors were appearing in the logs regular as
clockwork, so a good place to start seemed to be to look for any
execve()
calls. Maybe a non-zero return value would point us right to
the culprit.
bpftrace deserves a blog post of its own. One thing it can do, is do strace-like functionality, but for all processes. Instead of strace -e execve on a single application, I do this:
# bpftrace -e '
tracepoint:syscalls:sys_enter_execve
{
printf("%s.%d [%d] execve (%s) ", strftime("%H:%M:%S", nsecs), nsecs % 1000000000, pid, comm);
join(args->argv);
}'
tar (child): xz: Cannot exec: No such file or directory
...
11:46:40.231050035 [2826375] execve (zfs-driver) zfs list -d 1 -s name -o name,guid,available,used -H -p
11:46:40.231597246 [2826376] execve (zfs) chroot /host /sbin/zfs list -d 1 -s name -o name,guid,available,used -H -p
...
This will show all execve()
calls happening on the system.
You can safely ignore that
xz: Cannot exec
. But if you want to get rid of it, see the tips in the bpftrace-deb README:Another thing to note is the comm — the argument between parentheses like (zfs-driver) — which looks like the parent process. And that is not because it is the parent but because the parent forked/cloned itself before becoming the new process: when your bash shell invokes a new process, it first fork()s so there is a bash parent and a bash child. The child then uses execve() to become the new process. The bpftrace log shows the moment just before the transition. That explains that comm looks like the parent.
The log messages were generally about 5 minutes apart, so I didn't have to wait long for the next message to appear. And indeed, I didn't have to wait very long. The log listed the next error:
[vr mei 30 11:46:40 2025] Invalid ELF header magic: != \x7fELF
And around that time, these suspicious looking execve calls were made.
11:46:40.231597246 [2826376] execve (sh) modprobe rbd
11:46:40.231597246 [2826376] execve (sh) umount
csi-rbdplugin in CrashLoopBackoff
That reminds me, I did see some csi-rbdplugin
rook-ceph Kubernetes
pods in CrashLoopBackOff state.
That was no coincidence. All failing pods were on the new Noble nodes. The pods running on Jammy were fine.
I ran modprobe rbd
manually on the node and waited for the next cycle.
11:51:44.533278002 [2826376] execve (sh) modprobe nbd
I imported the nbd module as well. Now the all the pods started successfully, and the ELF error disappeared from the logs.
Old modprobe on new system
Now everything fell into place: during initialization, the ceph pods try to load rbd and nbd modules. But on the newer systems they couldn't.
First, let's look at the modprobe version on the host system:
root@noble.example.com# modprobe --version
kmod version 31
+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL
Next, the modprobe version inside the quay.io/cephcsi/cephcsi:v3.8.0
image:
sh-4.4# modprobe --version
kmod version 25
+XZ +ZLIB +OPENSSL -EXPERIMENTAL
Here's the critical difference. Previously, on the older Jammy systems, the kernel modules were not compressed:
root@jammy.example.com# find /lib/modules/$(uname -r) -name rbd*
/lib/modules/5.15.0-116-generic/kernel/drivers/block/rbd.ko
But now, on the Noble systems, they were compressed with ZSTD:
root@noble.example.com# find /lib/modules/$(uname -r) -name rbd*
/lib/modules/6.8.0-60-generic/kernel/drivers/block/rbd.ko.zst
To confirm the kmod incompatibility directly from within the container, I did a quick test:
sh-4.4# cat /etc/os-release
NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
sh-4.4# find /lib/modules/6.8.0-60-generic/ -name ipta*
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_raw.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_security.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_filter.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_nat.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_mangle.ko.zst
sh-4.4# modprobe iptable_nat
modprobe: ERROR: could not insert 'iptable_nat': Exec format error
[vr mei 30 12:36:09 2025] Invalid ELF header magic: != \x7fELF
Yes, that test confirms that modprobe was the culprit. The older kmod version (25) inside the container does not cope with the ZST compressed kernel modules found on the Noble host. Manually preloading the rbd and nbd modules on the nodes was the easiest fix for now.
Technical nuance
While I was initially looking to blame an execve call for directly
triggering the “Invalid ELF header magic” error, it turns out that's
not strictly possible. An execve call attempting to run something that
is not a valid binary will typically return ENOEXEC
(8) with the
message “Exec format error” to the calling process. No kernel error
would be logged.
The kernel error we observed is actually only triggered internally by the kernel during a module load failure. This can be seen in the kernel source code for module loading:
/*
* Check userspace passed ELF module against our expectations, and cache
* useful variables for further processing as we go.
...
*/
static int elf_validity_cache_copy(struct load_info *info, int flags)
{
// ...
if (memcmp(info->hdr->e_ident, ELFMAG, SELFMAG) != 0) {
pr_err("Invalid ELF header magic: != %s\n", ELFMAG);
goto no_exec;
However, tracing the execve calls was still crucial as it directly pointed me to modprobe as the culprit process. So, while the initial assumption about the direct trigger mechanism was slightly off, the debugging tactic was ultimately effective.