the mystery of pod crashloops and etxtbsy

Jul 1

Written By R T

This is Part 1 of a Kubernetes debugging report. This part is focussed on primary issues that caused worker nodes to completely explode.
Part 2 came before this, but it’s available here.

Warning

I just found some awful stuff in this app, and thought that it was interesting enough to explain in more detail so that we might all learn from it together. However, I am not a Rust or Kernel engineer. As such, some of this might be wrong, or… ‘misguided’.

The problem

I run core-dump-handler for handling dumps in Kubernetes from containers that have crashed. There’s a handful of issues with this app, but the one we’re discussing today is this. When the container tries to start, occasionally I am greeted with the following:

Error: Text file busy (os error 26)

I can’t seem to find the log lines that preceded this, but there’s a similar open issue on github that shows similar logs:

[2022-04-26T20:58:18Z INFO  core_dump_agent] Copying the composer from ./vendor/default/cdc to /home/kubernetes/bin/cdc
Error: Text file busy (os error 26)

I had never seen “Text file busy” in my tenure as an Ops guy, so I thought it was worth digging into what this meant, and how it could be fixed.

The basics

Lets cover some terminology and definitions that we’ll need in this here post:

inode - Index Node - The thing that tells the kernel where a file actually lives on the disk.
dentry - Directory Entry - The thing that tells the kernel what inode lives on what path
fd - File Descriptor - The thing describes to the kernel what the file is, where it lives, pointers to operations, etc.

etxtbsy

ETXTBSY is a kernel-level error that you will get when you try to write to an inode that is currentlybeing executed by the kernel, the kernel will throw you this error. “OS Error 26” is ETXTBSY.

You tend not to see this in modern systems, because there are atomic functions provided by the filesystem that mean that you very rarely actually write to the very same file that you read in. (Spoiler: Most of the time, you will be writing to a completely new file, which is then shunted into the location where the old one was).

That’s why when you google “ETXTBSY”, the first article you will see is one entitled:
The Shrinking Role of ETXTBSY

How to copy a file

There’s functionally two ways to copy a file in Linux. Naive copies, or Atomic copies. I’ll explain these for you using some of the actual system calls that you might see along the way, and you can see the difference between the two.

Naive copy

open() FileA to get an fd for the file as fd_FileA
open() FileB to get an fd for the file as fd_FileB
write() the contents of read(fd_fileA) to fd_FileB
close() both file descriptors fd_FileA, and fd_FileB

That is it. No more, no less.

Atomic copy

open() FileA to get an fd for the file as fd_FileA
mkstemp() to get an fd for a temporary file as fd_tmpFile
write() the contents of read(fd_fileA) to fd_tmpFile
close() handles to fd_FileA and fd_tmpFile
rename() the tmpFile to FileB

As explained, rename() is an atomic function provided by the filesystem. This function uses write-ahead logging, essentially meaning that when it performs an action, it is statefully aware of what was and wasn’t finished if the system crashed mid-task.

What is pertitent to the topic at hand is that rename() does not preserve the inode of the destination, or try to write to it. What it does is unlink() the destination dentry from the inodes for those files, and updates the source dentry with the path of the destination. Now, if the kernel has an fd open to a particular file, the inode that the kernel is executing is never written to, so the kernel is free to finish what it’s doing with that inode and close it out. Only once it’s finished using it will the inode actually get garbage collected by the filesystem.

How the app works

Now you know what causes ETXTBSY errors, you’ll never guess how it works… Oh, yes, that’s right: It does a naive copy.

#...    
info!("Copying the composer from {} to {}", location, destination);
fs::copy(location, destination)?;
#...

fs::copy() just gets an fd for location and destination and copies the file’s contents. This is Naive. It is not using any atomic functions for this. If the host is currently using the cdc binary to write a core dump and your container restarts, the container will crash.

You’ll never guess what kubelet does when core-dump-handler crashes… yes, it writes a core dump using cdc, meaning that if Kubernetes has not added any crashLoopBackoff delays yet, it will immediately start again and crash… again.

Now, you combine that with this secondary issue with this chart, and you’ve got yourself a really really unhappy host in absolutely no time.

How to fix it

I’m yet to raise a PR for this, because it seems as though there’s been no PR approvals for the last two years, so I’m not in any particular hurry, but the fix really isn’t that hard. We just need to use at atomic functions, which are already available in Rust.

A Temporary file

A temporary file would allow us to write to a completely separate inode, and “replace” the file in place using tmp.persist() which uses the kernel’s atomic replace() function.

It’ll look something like this:

use tempfile::NamedTempFile;
use std::os::unix::fs::PermissionsExt;
# ...
fn copy_core_dump_composer_to_hostdir( #...
    let destination = format!("{host_location}/{CDC_NAME}");
    # ...
    let mut src = File::open(&location)?;
    let mut tmp = tempfile::NamedTempFile::new_in(host_location)?;
    std::io::copy(&mut src, &mut tmp)?;
    tmp.as_file().set_permissions(fs::Permissions::from_mode(0o755))?;
    tmp.as_file().sync_all()?;
    tmp.persist(&destination)?;
    #...

A not-so-temporary file

We could also do the below; albeit this is slightly more risky, because you risk potentially removing files that people have plonked on their file system themselves. This uses fs::rename(), which again uses Linux’s atomic rename() ‘under the hood’.

fn copy_core_dump_composer_to_hostdir( #...
    let destination = format!("{host_location}/{CDC_NAME}");
    let tmp_destination = format!("{host_location}/{CDC_NAME}.tmp");
    #...
    info!("Copying the composer from {} to {}", location, destination);
    fs::copy(&location, &tmp_destination)?;
    fs::rename(&tmp_destination, &destination)?;
    # ...

The lazybones way

Even more simply, knowing that the cdc binary file must be managed by ‘us’, we can just fs::remove_file() first. This isn’t even atomic, but this performs an unlink() of the dentry before we copy the file. This means that the kernel can continue to use any inodeit has open, but frees the dentry to be replaced by the fs::copy().

Granted, this is far less safe because there is a window where the file is gone, but not created yet which opens a window for missed core-dumps, or race conditions with other things creating the cdc file before we get to do it; but core-dumps aren’t all that important (at least to me), and a race condition here isn’t really a statistical probability, so this is a reasonable middle-ground between the risks and essentially rewriting the entire function.

# ...    
let _ = fs::remove_file(&destination);
fs::copy(location, destination)?;
# ...

How the maintainers fix this

According to the README.md in the repo, and on this issue, rather than actually fixing this, is to:

Delete the chart. Don't worry this won't impact the data stored in object storage
Ensure the persitent volume forhost-name are deleted before continuing
Install the chart using the same bucket name as per the first install but tell the chart not to creat[e] it

To me, this is just absurd.

My take

I really don’t like throwing shade on other developers who are just trying to do their best with the information they have, inside the guardrails their company is giving them… but I am genuinely flabbergasted (or am I?) that IBM want their name associated with this.

At the time of writing this all the workflow, on the open PRs that are pending, haven’t even worked for more than three months; and there’s not been a single commit merged into the trunk for over two years.

This isn’t the only issue with this repo, either. There’s another issue with this that I’ve covered in this post, where the container gets into a crash loop (caused by this issue), and that crash loop absolutely hoses the worker node with mounts due to mountPropagation craziness.

I mean, just look at the quality of this function: https://github.com/IBM/core-dump-handler/blob/c315d2d79494ab4dd8f62b460d23892d6138ac14/core-dump-agent/src/main.rs#L445

It’s the wettest code I’ve ever seen. What happens if we add more supported operating systems? Do we just copy the rhel7 part of the function again?

Honestly, I’m tempted to learn Rust just to rewrite this entire thing in a sane way. Time to log back into Udemy? Maybe another night… Watch this space.

R T