Securing Kubernetes with Smack — solving the problems of the PoC

8 min readJul 5, 2021

After I published the first post about my PoC, I decided to continue refining the idea and to create a running, secured Kubernetes cluster out of it. Here are the next challenges I tackled and the final solutions. In order to understand the following discussions, you should definitely go through the previous post first.

Each container with dedicated label and code refactoring

Let’s be honest — the first idea was good, but very insufficient — all containers used the same, well-known label “container”. Although they couldn’t touch any files on the node (theoretically), they could touch each other’s files. There was also one small problem with the way how containers work. Remember the image from the last post?

As I described in the first post, when runc is called, it does some final preparations to the rootfs and then (step 4 on the image) calls the “execve” syscall. This syscall essentially replaces the calling program with the “to be executed program” in its own memory. This is in contrast to “fork”, which spawns a new process and loads the called program inside the new process’ memory.

In the world of Smack, there is a slight problem with the “execve” syscall — if the calling process and the called program have different SMACK64EXEC labels, which one should the final process have? By default runc runs as floor (“_”). When runc replaces itself with the program which should be running inside the container, and we have changed the execute label to “container” by invoking:

chsmack -a container -e container /path/to/executable/in/container

then “execve” will fail with -EPERM. The only case, in which “execve” could work is, if runc also has the label “container”. If this is always the same label, there is no problem, but if we want to have a dedicated label for each container, things get more complicated.

My simple solution to this problem was to copy runc inside the container’s task folder. The task folder is a folder created by containerd and the folder, where the rootfs of the container is stored. This folder also gets deleted, when the container dies, so all things we place there will be cleaned up later. By default, for Kubernetes this folder can be found under:

/run/containerd/io.containerd.runtime.v2.task/k8s.io/{container-id}

So in my solution, each time a container is run, runc is copied from its main folder (e.g. /usr/bin/runc) to the corresponding container’s task folder and then its label is changed to the label of the new container. Now “execve” will be executed without errors. But when should this copy happen? Let’s have a look on how the process tree looks like during container initialization:

The runc process in the bottom — init, is the one replacing itself with the container executable. This is also the process, which should have the executable label of the container, in order for “execve” to work. Since the init process is a child process of runc create, it is meaningful to copy runc to the task folder inside the runc create process. In such case the runc create process will deal with relabeling files, therefore I also decided to move the whole rootfs relabeling code to the runc create process. The easiest solution with as few modifications as possible, I found to be modifying the “commandTemplate” function in the file “container_linux.go”. The new code looks something like this:

The full runc patch file with all modifications can be found on GitHub. The code above shows the following steps:

1) determine if Smack should and can be used;
2) determine the host’s label;
3) determine if we’re preparing the runc init process, which indirectly means, we’re in the runc create process;
4) copy runc to the task folder, make it executable and set its label to the same as the new container;
5) use the “/sys/fs/smackfs/load2” interface to load the new rules with container’s label;
6) recursively set the container’s label to all files in its rootfs folder.

Make Smack usage selective

In the above procedure, step 1 determines if Smack should and can be used. The check if Smack can be used is easy — if the smackfs system is properly mounted and the “load2” interface exists, Smack can be used:

But what if, even if Smack is available, we don’t want to use it. For this option, I decided to add two new command line parameters to the runc create command:

The first enables the Smack functionality, the second one sets the label of the host — the one which all containers should not be able to touch. Since the runc create command is started by containerd-shim-runc-v2, a modification to this code was needed. The runc create command is prepared in the sub-file “vendor/github.com/containerd/go-runc/runc.go” and specifically in the subfunction “func (o *CreateOpts) args() (out []string, err error)”:

As it can be seen from the snippet above, I’ve written a small function, which checks a configuration parameter in the containerd configuration file. This is a very quick and dirty solution. The proper way would be to propagate this configuration from containerd down to the containerd-shim over protobuf gRPCs. Since this was not the target of my PoC, I decided to go with an easier, simpler and not so error-prone solution.

The full patch to containerd can be found on GitHub. The patch adds a new configuration parameter — smack_host_label, which can be added to the config.toml file of containerd. An example config.toml file can be found on GitHub.

Networking

Smack can not only be used to block access to local resources, but also to remote ones. As described in the official Smack documentation, Smack relies on CIPSO. This (shortly explained) means, that each outgoing IP packet receives a new tag, containing the label of the sending process. Each incoming IP packet either has such a tag, or it is assigned the “ambient” label, which could be configured via “/sys/fs/smackfs/ambient”. By default this is the floor “_” label. Smack will allow packets to reach processes, only if the IP packet has a tag with a label, which has “w” (write) rights on the destination process’ label.

Effectively this means, that processes, which have different labels and have no “w” rule assigned between both labels are unable to communicate. On a Kubernetes cluster this is deadly, because the etcd database and the api-server should be able to communicate with each other (and of course many other containers should be able to communicate with each other and with the Internet).

The Smack documentation gives an easy solution (towards the end of the page) — assign every IP address the “Internet” label. This happens with the following command:

echo “0.0.0.0/0 @” > /sys/fs/smackfs/netlabel

All packets going to 0.0.0.0/0 (thus going everywhere) are now automatically unlabeled and each process can send or receive IP packets from this CIDR.

Smack in newer kernels ignores CIPSO and uses SECMARK instead, because “secmark is assumed to reflect policy better”. In order to “unlabel” all packets using SECMARK execute the following two commands as root:

iptables -t mangle -A INPUT -j SECMARK — selctx _
iptables -t mangle -A OUTPUT -j SECMARK — selctx _

Now every incoming or outgoing packet will have the floor label “_” and since our containers have “rwxat” on the floor label, they will have no communication problems.

I will research later how this technology can be used to limit communication to packets from the same Kubernetes namespace only.

Paths which should not be locked

While researching the PoC I tried a very simple script, in order to lock the whole host from every container:

find / -type f -exec chsmack -a host {} +
find / -type d -exec chsmack -a host -T {} +

Suddenly the whole Kubernetes cluster stopped working. The reason for this was, that there are some host paths mounted in important Kubernetes pods like: api-server, controller, coredns, etc. I found these paths to be:

/etc/kubernetes
/usr/libexec/kubernetes/kubelet-plugins/volume/exec
/usr/share/ca-certificates
/var/lib/containerd/io.containerd.grpc.v1.cri/
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
/var/lib/kubelet/pods

Depending on the Kubernetes CNI plugin being used, also the following ones (or others) should be open for containers:

/etc/cni/net.d
/etc/openvswitch
/opt/cni/bin
/run/openvswitch
/var/log/openvswitch

My “after-boot” script, which can be found on GitHub, looks like this now:

find / -type f -not -regex “^/etc/kubernetes.*” \
-not -regex “^/usr/libexec/kubernetes/kubelet-plugins/volume/exec.*” \
…
-exec chsmack -a host {} +
find / -type d -not -regex “^/etc/kubernetes.*” \
-not -regex “^/usr/libexec/kubernetes/kubelet-plugins/volume/exec.*” \
…
-exec chsmack -a host -t {} +

I’ve prepared a oneshot systemd service, which runs every time the machine starts and is a prerequisite for the kubelet and containerd services to start. The code for it can be found on GitHub.

It is important to note, that if a malicious person tries to mount a host path with a sub-folder of these folders, they will succeed and will be able to modify everything there. My solution to this problem was to move all these “standard” folders to new folders by modifying the /etc/containerd/config.toml file. This makes it hard for malicious persons to “guess” sub-folders and mount them. And they don’t have the rights to traverse them from the root folder, as they don’t have access to root‘s sub-folders.

Conclusions

I have been running this “hardened” Kubernetes cluster on an ArchLinux VM for the last week. Until now, there were no problems with it and even survived several (some of them wanted, others — not) reboots. I will continue to refine things on the PoC — proper configuration parameter propagation from the config.toml file to the containerd-shim process, better configuration propagation inside the runc create process, usage of Smack as a network policy enforcer, etc. And I will keep you posted on the further developments.