@hlieberman, the researchers imply container escape == root access on the target host, that is why they used a setuid binary in order to demonstrate the whole exploit. What this article mentions is that while the container escape (as in the ability to modify a binary in memory that may be shared across multiple containers) is still present gaining root in the underlying host doesn't happen.
@isityettime, the vulnerability happens not because of file contents being modified on disk (think of a base image that is shared across multiple CI builds) rather because a binary in one base image shares the same inode (and thus the same address space in memory as an optimization) as the same binary in another container, meaning container B will execute a poisoned binary and that's where the "container escape" happens.
@amluto, fair points, I still consider this vulnerability extremely severe as yes, if you use rootless containers the attacker won't be able to get root on the host and with that injecting additional potential malware, but at the same time for containers re-using the same image layers it will give the ability to poison those binaries in memory and break to some extent the container to container isolation.
If we weren't looking into moving away from using containers completely into using ephemeral microVMs one area I'd invest in would be replicating what CargoWall does for GitHub actions in GitLab CI. At that point even if the attacker gained access to a container, modified a binary with some specific instructions (like reading env vars and sending them to an external server) it'd not be able to send credentials or fetch a malware remotely at all due to the DNS queries being intercepted by eBPF and being sent to a CoreDNS proxy.
I still think rootless containers increase the attack vector complexity in more than just a one liner into an attack scenario that, at that point, should also involve understanding additional details about the underlying host with information such as, as you correctly pointed out, what container images (and thus shared image layers) are present and also whether these images use setuid binaries which specific CI jobs explicitly call throughout the build process (kind of unusual to see anyone running a setuid binary in a CI pipeline anyway as that is generally an action that would result in a permission denied in normal conditions).
> for containers re-using the same image layers it will give the ability to poison those binaries in memory
Huh? Running as root in a container doesn't let you write to the image you're running as, right? Layers aren't writable from inside a running container; the mechanisms for writing the the image layer cache are separate from the mechanisms for running. You'd have to bind mount the daemon socket into the container to be vulnerable to that, wouldn't you?
1. I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.
2. The write-to-RO-page-cache primitive STILL WORKED! It’s just that the particular exploit used had no meaningful effect in the already-root-in-a-container context. If you think you are safe, you’re probably wrong. All you need to make a new exploit is an fd representing something that you aren’t supposed to be able to write. This likely includes CoW things where you are supposed to be able to write after CoW but you aren’t supposed to be able to write to the source.
So:
- Are you using these containers with a common image or even a common layer in an image to isolate dangerous workloads from each other. Oops, they can modify the image layers and corrupt each other. There goes any sort of cross-tenant isolation.
- What if you get an fd backed by the zero page and write to it? This can’t result in anything that the administrator would approve of.
- What if you ro-bind-mount something in? It’s not ro any more.
In fact, the authors specifically say on the very first line of their website that the copy/fail primitive can be used as a container escape. The entire premise of this article is flawed and irresponsible.
tl;dr - within the container, the exploit works, and elevates to root (uid 0) within the container - BUT because that namespace actually maps to uid 1000 (the user) outside the container, the escalation does not flow up to the host.
But… does this escape the container? If not (the author seems to indicate it does not) then does it matter if you are in Docker or rootless Podman, right, since the end result is always: you have elevated to root within the container. If the rest of the container filesystem isolation does its job, the end result is the same? Though I guess another chained exploit to escape the container would be worse in Docker? Do I have that right?
@isityettime, the vulnerability happens not because of file contents being modified on disk (think of a base image that is shared across multiple CI builds) rather because a binary in one base image shares the same inode (and thus the same address space in memory as an optimization) as the same binary in another container, meaning container B will execute a poisoned binary and that's where the "container escape" happens.
Couldn't you then simply re-run the exploit again as unprivileged podman user and gain root on the host?
If we weren't looking into moving away from using containers completely into using ephemeral microVMs one area I'd invest in would be replicating what CargoWall does for GitHub actions in GitLab CI. At that point even if the attacker gained access to a container, modified a binary with some specific instructions (like reading env vars and sending them to an external server) it'd not be able to send credentials or fetch a malware remotely at all due to the DNS queries being intercepted by eBPF and being sent to a CoreDNS proxy.
I still think rootless containers increase the attack vector complexity in more than just a one liner into an attack scenario that, at that point, should also involve understanding additional details about the underlying host with information such as, as you correctly pointed out, what container images (and thus shared image layers) are present and also whether these images use setuid binaries which specific CI jobs explicitly call throughout the build process (kind of unusual to see anyone running a setuid binary in a CI pipeline anyway as that is generally an action that would result in a permission denied in normal conditions).
Huh? Running as root in a container doesn't let you write to the image you're running as, right? Layers aren't writable from inside a running container; the mechanisms for writing the the image layer cache are separate from the mechanisms for running. You'd have to bind mount the daemon socket into the container to be vulnerable to that, wouldn't you?
1. I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.
2. The write-to-RO-page-cache primitive STILL WORKED! It’s just that the particular exploit used had no meaningful effect in the already-root-in-a-container context. If you think you are safe, you’re probably wrong. All you need to make a new exploit is an fd representing something that you aren’t supposed to be able to write. This likely includes CoW things where you are supposed to be able to write after CoW but you aren’t supposed to be able to write to the source.
So:
- Are you using these containers with a common image or even a common layer in an image to isolate dangerous workloads from each other. Oops, they can modify the image layers and corrupt each other. There goes any sort of cross-tenant isolation.
- What if you get an fd backed by the zero page and write to it? This can’t result in anything that the administrator would approve of.
- What if you ro-bind-mount something in? It’s not ro any more.
But… does this escape the container? If not (the author seems to indicate it does not) then does it matter if you are in Docker or rootless Podman, right, since the end result is always: you have elevated to root within the container. If the rest of the container filesystem isolation does its job, the end result is the same? Though I guess another chained exploit to escape the container would be worse in Docker? Do I have that right?