Kubernetes & exponential Bidirectional Mounts

Jul 1

Written By R T

This is Part 2 of a Kubernetes debugging report. This part is focussed on secondary issues caused by the crashLoop, and what actually causes it in the kernel.
Part 1 is available here.

WARNING

I am not a kernel engineer. I’m just an Ops guy who started with “why the hell are my mount tables exploding exponentially” and ended up reading through the witchcraft and voodoo of the Linux Kernel. Some of this may be incorrect or ‘misguided’.

This post is going to get very dense with information about Kubernetes mountPropagation; The Kernel’s sourcecode, functions and behaviors.
Engage the nerd-afterburners in your brain before you carry on reading.

The problem

My pod configuration is (was) using the following mount config:

- name: host-volume
  mountPath: /var/mnt/core-dump-handler
  mountPropagation: Bidirectional
- name: sub-volume
  mountPath: /var/mnt/core-dump-handler/cores
  mountPropagation: Bidirectional

Whenever the container restarted, I saw a fairly unpredictable increase (anywhere from one, to multiple thousands) in the number of mounts on the host. It took a while to figure out what was actually happening, but I managed to reduce this down to a simple replication method:

ps auxf | grep '[a]pp/core' && \
kill -9 $(ps auxf | grep '[a]pp/core' | awk '{print $2}') && \
echo "YASSSS";
sleep 10;
cat /proc/mounts | grep 'core' | wc -l

Plonk that in a little script, and this showed me that I actually had exponential growth:

[root@host ~]# ./killScript.sh
15
[root@host ~]# ./killScript.sh
31
[root@host ~]# ./killScript.sh
63

Given 12, 13, 14 iterations, I’m suddenly staring down the barrel of tens of thousands of mounts in the host’s mount table. This severely handicaps system performance, and you can see a dramatic increase in the sys CPU% on the host as it tries to process all of these mounts whilst performing it’s jobs.

Anything that touches the mount table (ContainerD or Kubelet, for example) become extremely slow, and struggle to even start/stop pods & containers on that host. Even logging into it to clean it up manually is nigh on impossible, and the instance often needs restarting or trashing for replacement. I wasn’t even able to drain these nodes in some cases.

What I thought was odd about this was that whilst I expected to see mounts leaking back to the host, what I didn’t expect to see was the number of mounts almost doubling with each iteration. That is to say, instead of:

$countMounts = $containerRestarts

What I actually see is:

$countMounst = (2 ^ $containerRestarts)-1

Which, at the time of discovering this, I thought was just… odd. And anyone that knows me knows that when I find something that looks ‘odd’, I cannot help myself but go and find out the cause of the odd thing.

The solution

… to this specific application’s issues, quite simply, is to stop using bidirectional mounts. They are entirely unnecessary, and yet, they are the default mountPropagation mode for this chart.

Anyway, that answer is really boring, and I know that’s not why you’re here, so lets get back to the juicy bits…

Kubernetes mount modes

Kubernetes has a handful of mountPropagation modes.

‘None’ = Private

mountPropagation mode of “None” (or just unset) asks the kernel to provide a "Private” (MS_PRIVATE) mount. This is the typical kind of mount in kubernetes, and provides the kind of isolation you want from a containered environment.

‘hosttocontainer’ = slave

mountPropagation mode of “hostToContainer” asks the kernel to proide a “Slave” (MS_SLAVE) mount. This means that it will receive mount events from the host, but it won’t be able to leak mounts back down onto said host.

‘bidirectional’ = shared

mountPropagation mode of “bidirectional” asks the kernel to proide a “Shared” (MS_SHARED) mount.

This means that mount events that take place inside the container’s mount namespace can propagate down to the host. This is helpful if you want to manage the mounts on the host from the container, but will very likely impact your security exposure somewhat drastically as a compromised container can, for all intents and purposes, trivially take control of the host.

There are very few usecases for bidirectional mounts, so if you find yourself thinking that you need this mount mode, you really need to consider your choices. And also, you should probably keep reading to make sure you understand the implications of this.

Kernel mount modes

MS_PRIVATE

A private mount is completely isolated from mount propagation.

It doesn't belong to any peer group and has no master. Mount events happening elsewhere never reach it, and mount events happening inside it never propagate anywhere else. It's the default mode for mounts.

MS_SLAVE

A slave mount has a master.

Specifically, the master is from a shared peer group that the slave is subordinate to. When a mount event occurs in the master's group, the slave receives a copy of that event (mounts appear inside it automatically). The reverse never happens. if something is mounted inside the slave, that event stays local (similar to private) and is never propagated back to the master's group or anywhere else. It's a one-way relationship: master → slave.

MS_SHARED

A shared mount is a member of a peer group.

When a mount event occurs on any peer in the group, the kernel propagates it to every other peer. Each peer gets a copy of the mount. It's fully bidirectional: Events flow in and out equally, and the kernel "walks" the mounts to propagate the events (subject to some checks on whether or not they are needed for that particular mount).

Any mount created inside a shared mount is broadcast to all peers in the group, i.e any event from a peer is received by every other peer too.

MS_SHARED, specifically

Now, we’re going to talk in detail about what happens with Shared mounts, and how they work. Understanding what happens with these is fundamental to this problem.

Event propagation

A shared mount group is a group of mounts that exist in a peer group. When a new mount succesfully joins the group, the kernel walks every other peer in the group and asks: "is this mount event relevant to you?". This relevance check is simple and essentially does:

Is this mount event not for the peer that instigated it?
Is the mountpoint a sub directory of the root path of a peer?
Is this not an orphaned peer?

Lets say you add the following mounts in a shared mount group in order:

- Root: /
- Mount1: /var/1 > /mnt/a
- Mount2: /var/2 > /mnt/a/sub_mount

This will create a mount table that looks like this:

/ /                       (The root mount)
/var/1 /mnt/a             (Mount1)
/var/2 /mnt/a/sub_mount   (Mount2 original. Parent: Mount1)
/var/2 /var/1/sub_mount   (Mount2 copy. Parent: '/')

As you can see, when you add the last mount, both peers in this ring receive the mount event, and mount it.

Under the hood

Now, lets talk kernel functions.

The functions that matter here are the following:

propagate_mnt() - The function that copies mount events to peers.
need_secondary() - The relevancy check ran for each peer.
is_subdir() - “Is the desired mountpoint a subdirectory of this peer’s mount root?” check.

When the mount runs and succeeds:

the event gets passed to propagate_mnt(), which is the thing that walks through the peers in the group.
propagate_mnt() runs need_secondary() for each peer to determine if the event should be ran for that peer.
need_secondary() uses is_subdir() to check if event’s mountpoint dentry (i.e. the thing you want to mount) is a subdir of the the peer’s mount root… and this is where things start to get interesting…

is_subdir()

The function description for is_subdir() is:
Returns true if new_dentry is a subdirectory of the parent (at any depth). Returns false otherwise. […]

The first few lines of is_subdir() are as follows:

bool is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
{
    if (new_dentry == old_dentry)
    return true;

    # and then lots of other stuff
    # ...
}

This means that if you run is_subdir() for two dentries that are the same dentry, you will get a response of true. For example (with subpaths instead of dentries):

is_subdir(</my/test/dir>, </my/test/dir>) # returns true

However, this is obviously implied, intentional, and expected; and thusly hinted at by the “at any depth [including depth 0]” in the function description.

Which group to join?

The group is inherited from the source. Lets mount /var into /tmp/mnt:

68 1 259:1 / /              ... shared:1 ...
39 68 0:32 / /tmp           ... shared:17 ...
4831 39 259:1 /var /tmp/mnt ... shared:1 ...

Whilst /tmp/mnt’s parent ID is 39 (tmp), it has inherited the shared:1 group because it’s source is also shared:1.

Tie it all together

Iteration One

When bidirectional (shared) mounts go onto our pod as:

/var/mnt/core-dump-handler/ (mount1)
/var/mnt/core-dump-handler/cores (mount2)

The container starts and mount1 is mounted into the container’s process as shared:1, which you can see in /proc/$pid/mountinfo:

4649 4637 259:1 /var/mnt/core-dump-handler /var/mnt/core-dump-handler ... shared:1 ...

Because mount2 is a subdir of mount1 it “passes” the relevancy checks and we mount it “on top” of mount1.

However, mount1is part of shared:1peer group, and it just did a mount event… so the event that applied the mount2 to mount1 also gets given to all other peers in that group. And one of those peers is… our host:

/ / ... shared:1 ...

This means that when your container runs for the first time, the mount2 mount will also run on that peer, and join the ring on the actual host as another peer of the shared:1 group. This causes the shared:1 peer group to look like (sans the /var/mnt/core-dump-handler source):

/  /
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores

iteration two

Container runs, mount1 mounts, mount2 mounts. mount1 passes the mount event to all of it’s peers, of which there are now two (see above).

However, one of those peers is /var/mnt/core-dump-handler/cores. According to the “relevancy” checks performed in need_secondary(), /var/mnt/core-dump-handler/cores is a subdirectory of /var/mnt/core-dump-handler/cores but with a “depth” of zero… meaning that it should mount this mount… directly on top of itself.

Thusly, as well as / running the mount event, so, too, does the /var/mnt/core-dump-handler/cores peer.

And so our peer group looks like this:

/  /
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores

Iteration three

container runs, blah blah blah… mount1 passes event to peers… needs_secondary() for each of the ‘…/cores/’ mounts passes, and our mount table looks like this:

/  /
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores
/var/mnt/core-dump-handler/cores /var/mnt/core-dump-handler/cores

Iteration X

You’ll notice, now, that the mounts are doubling with each iteration. And they do so at a rate of

(2 ^ $iteration) - 1

The horses mouth

Realistically, I can barely read the source code as it is… I have no idea how these functions are used; how shared mounts are used around the globe; and ultimately, if this kind of behavior for same dentry in MS_SHARED is intentional by it’s author. As such, I’ve requested some information from the experts, the Linux FS devs:

https://lore.kernel.org/linux-fsdevel/LNXP123MB3691574893C2958E77C8FA40ACE82@LNXP123MB3691.GBRP123.PROD.OUTLOOK.COM/T/#u

My hope is that these chaps can give some insight into this behavior. I’m hoping for some explanation as to why it works the way it does, or if it’s just a simple case of “do stupid things, get stupid results”.

My take

Again, I have zero experience with Kernel development, and I have absolutely no idea how these processes, functions, and behaviors are used around the globe. That’s what the FSDev chaps are for - they have extreme depth in this particular niche of linux, and are frankly we aren’t even in the same galaxy in terms of technical expertise.

But I’m going to give you my opinion anyway:

I think this is a bug.

It seems as though need_secondary() is using is_subdir() exactly how I would… “Is this a SUB-directory”, and not “is this a subdirectory OR the same directory”. This leads me to think the behaviour here is, whist expected (i.e. what the code actually does), not intended (i.e. not what the author intended for it to do).

I have, in my fairly long time working in Operations and systems management, never come across a case where a directory would need to mount to… itself? What possible case exists for mounting /var/bob to /var/bob? However, the world is a remarkable place, and I am sure that there is someone out there that is using this “feature” in some absolutely bizarre way to solve problems I can’t even comprehend.

As such, I will impatiently refresh my emails every hour awaiting the reply of the clever chaps over at Linux FS Dev, and update here accordingly.

Replicating this yourself

You can run this small replication to see this in action yourself. If you run this 13, 14, 15 times… keep an eye on your sys CPU%, and watch it BURN:

mkdir -p /tmp/test/dir;

# First generation - 1 mount
mount --bind /tmp/test/dir /tmp/test/dir && \
mount --make-shared /tmp/test/dir && \
cat /proc/mounts | grep 'test/dir';

# Second generation - 3 mounts
mount --bind /tmp/test/dir /tmp/test/dir && \
cat /proc/mounts | grep 'test/dir' 

# Third generation - 7 entries
mount --bind /tmp/test/dir /tmp/test/dir && \
cat /proc/mounts | grep 'test/dir' 

# … and so on

R T