Serg Nvns - Fotolia
RunC is a Docker-created, low-level command-line interface tool that spawns and runs containers based on two Open Container Initiative specifications: the Image Specification and the Runtime Specification.
When an OCI image completes its download, it unpacks into an OCI Runtime filesystem bundle.
In February 2019, a serious vulnerability (CVE-2019-5736) was found in the core runC container code that could let an attacker gain root access to the cloud host operating system as it runs multiple containers. Let's dive in to this runC vulnerability, the existing problems that led to it, along with some ways to mitigate the issue.
The vulnerability could allow an attacker to overwrite the host runC binary if they execute a root command with either type of container. An attacker could use a new container with an attacker-controlled image, or infiltrate an existing container if they already own previous write access.
If the attacker is able to successfully complete these malicious writing operations to the host, they could:
- Turn off the power on a cloud-based server.
- Kill other containers when called again.
- Prevent movement by other containers from one computing environment to another.
- Behave as a SUID cat. This would let an attacker gain the file owner's permissions and steal sensitive information in the file, such as usernames and passwords. However, the attacker doesn't get permissions from the user who runs the file.
An existing problem
Trusted workload workflows rely on a container image downloaded from the web. Workflows don't care if the downloaded image is malicious or not, and because of this, runC can get attacked through a malicious container image from the web, and in turn, can be used to attack other containers. And, when an attack is complete, malicious processes exit.
Aleksa Sarai, a runC maintainer and a Linux software engineer, explained in a patch email that Linux containers cannot be attacked by a malicious container image. The monitor process never exits during the container lifecycle. The kernel that is updated with user namespaces prevents changes to running binaries. This makes it impossible for an attacker to cause damage to the kernel. When the container shuts down or is killed, the task of attacking the container is stopped before it could be implemented. The container waits for the last process to exit before the monitor exits itself.
Red Hat, along with the U.S. National Security Agency, developed Security-Enhanced Linux (SELinux), to mitigate the runC vulnerability. It now serves as the default kernel on Red Hat Enterprise Linux and other Linux distributions. Furthermore, the proper use of user namespaces, which separate the user IDs and group IDs between containers, can block the runC vulnerability so that the host root is not mapped into the container's user namespace.
In 2013, Linux containers were updated to support the use of unprivileged containers when user namespaces were merged into the kernel. As a result, privileged containers are unsafe when they are used to run untrusted workloads. An attacker could use privileged containers to write to the host Linux container binary.
Linux developed patches to protect the containers, including:
- Create a temporary copy of the calling binary itself when it attaches to containers.
- Create an anonymous, in-memory file.
- Copy itself into the temporary in-memory file using the memfd_create() system call, which is then sealed to prevent further modifications.
- Execute this sealed, in-memory file instead of the original on-disk binary.
For example, if an attacker attempts to use a privileged container to write to the host Linux container binary, the patched container would send any writing operations to the temporary in-memory binary and not to the host Linux container binary. An administrator can avoid a downstream affect for users of the shared library with LXC_MEMFD_REXEC.
However, this patch does not work for workloads that place the Linux container binaries on a read-only filesystem or prevent them from running privileged containers. Set disable-memfd-rexec during the configuration stage and before a Linux container is compiled to disable writing operations to in-memory binary.