Using Smack to secure Kubernetes containers and nodes — a proof of concept
TL;DR: if you’re here only for the solution, skip to chapter “What is SMACK?”
Prologue
In the last couple of years, I spent a lot of my time managing on-premise cloud infrastructure. The advantage of on-prem is that infrastructure managers have full control of the stack — from the location, through the hardware, up to the running software. This allows them to force security rules to almost every aspect of the stack — availability zones, physical entrance rules, restricted hardware racks, SELinux, PodSecurityPolicies and even limiting the UIDs containers may run as.
But nowadays everyone is moving to the public cloud. The old-school infrastructure managers start their transition in an “on-prem” manner — booting up virtual machines, which they can then manage similar to their hardware machines. This is a path with a more shallow learning curve and this approach is totally understandable. In my opinion, this not only makes getting used to the new environments easier, but serves also a psychological cause — human beings don’t like losing control. But suddenly part of the stack is gone — no availability zones, no hardware racks anymore. When using VMs, at least the operating system and the rules on top stay.
With time, more and more infrastructure managers get comfortable with their new environments and start trying the so called “managed” services. Suddenly one further part of the stack is gone — the VMs. Even parts of the higher layers are not manageable anymore — in the case of an EKS or AKS cluster, the whole control plane is gone. In the case of ECS — the whole Kubernetes cluster is gone. Gone are SELinux, the PodSecurityPolicies, the UID limitations.
Introduction
When I moved to the public cloud I was surprised, that there are cloud providers, which do not give a choice of operating system for the worker nodes for your managed Kubernetes clusters. Others do allow such choice, but the default configuration is lacking almost any security measures. Until now I’ve seen only one cloud provider, which did a good job of providing a hardened, custom cloud operating system for their managed Kubernetes service — Amazon with its Bottlerocket OS. Others are using only Ubuntu.
I am (theoretically speaking) not against Ubuntu, but in my opinion any operating system, which is not natively supporting SELinux in enforcing mode, is not ready to be a true cloud operating system. Under “natively” I mean — coming with a good set of SELinux rules, without the need of administrators to build their own rule sets.
I’ve developed my own SELinux rules for Debian back in time, but the difference between my rule set and a “native” one is, that I am a single person and I am sure, I’ve missed something, while a “native” rule set is done by a group of people — by the community. Chances are, they’ve covered almost everything. Furthermore activating SELinux on a Debian derivative almost always causes some system failures, rendering systems unavailable for periods of time — not something infrastructure administrators and their managers brag about. Therefore there is some resistance of doing such a thing.
In this article I try to offer a more simple solution based on the Smack Linux Security Module (LSM).
Needs analysis
The first and most important question is: why do I need a MAC (mandatory access control) system? The second one would be of course: isn’t AppArmor enough?
A MAC system is needed in the cloud world, because a single worker instance runs multiple (sometimes thousands) subsystems. Usually these are known as containers — processes with a set of their own requirements (e.g. filesystem, libraries, packages, etc.) isolated in Linux namespaces. Usually these containers run as — you guessed it — root. There are four options for cluster administrators:
1) deny containers, which want to run as root;
2) use UID mappings;
3) limit root user’s privileges;
4) just let everything running as it is and pray.
A lot of organizations rely on option 4, because usually these systems are used by small teams, which are capable of taking care of most of the vulnerabilities, thus systems run well. But this model doesn’t scale. As soon as more team members work on the same clusters and as soon as some third parties (e.g. contractors) start pushing their own images, the chances of a supply chain attack start rapidly rising.
Option 2 is still in development by the community. Although containerd already supports UID and GID mappings, Kubernetes is not yet there. Even with this option, there are risks, because mappings are usually the same for all containers. This means, that all containers running as root internally, will have a UID of for example 1000 on node level. If a malicious person escapes a container running as UID 1000, they may not be able to do lateral movement on node level, but will be able to perform lateral movement on container level.
Option 1 is the standard used by some big, commercial enterprise solutions — including RedHat’s OpenShift. Such systems block containers, which try to run as root.
Option 3 is the one I would like to speak about here. MAC systems allow administrators to limit any user’s privileges based on some rules, on top of the traditional Linux DAC system (-rwxr-xr-x). “On top” means, that the Linux kernel first checks if the action is allowed by the DAC system and only if this is true, the kernel asks the MAC system. If the MAC system allows the action, then the kernel executes the job.
Using such a stacked approach it is easy to limit root’s privileges. A rule can exist, which states: “if the path of process X is Y, then don’t allow X to write to file Z”. Or another one: “if the label of process X is Y, then don’t allow X to read or write to Z”. Even if root is the effective owner of process X, process X will never be able to perform its actions on Z. I’ve already written a longer blog post on how this is working with labels in SELinux.
Imagine a system running 200 containers. We could assign each container a unique ID, let’s say CID (container ID). Now a rule can be created, stating: “allow processes part of CID X, to read, write and/or execute files part of CID X”. This effectively limits each container to its own space. It won’t be able to touch anything outside of its CID. Even on the node, which it’s running on. Even if it’s running as root. Caution should be taken to also limit the execution of programs, which may change such rules.
Isn’t AppArmor sufficient? In my humble opinion — no. AppArmor is a path based MAC system. Containers run in their own namespaces with a pivoted root folder. This means, if a container executes the process /bin/xyz in the container X, whose root filesystem is mounted at /var/containerd/containers/X/rootfs, the kernel will not see the executable path as /var/containerd/containers/X/rootfs/bin/xyz, but only as /bin/xyz. How should AppArmor now differentiate if /bin/xyz is inside a container (and in which container), or on the worker node itself? Furthermore, if the container runs as root, the malicious software could move the executable around, in order to find a place, which is not covered by AppArmor rules.
SELinux handles this better, but needs some system- and userspace-support for this — it uses object labels. Every process, every user, every file has its own label. SELinux then creates rules about which label may do what on which label. If there is no rule covering a set of two labels — the actions are denied. The hard part about SELinux is — it needs userspace program support, it needs filesystem support, it has a big complexity. If an operating system is not running with a “native” set of SELinux rules, then it is up to the administrator to create their own rule set. This is quite hard and error prone. The answer to these problems is … Smack.
What is Smack?
Smack is a Mandatory Access Control (MAC) Linux Security Module (LSM) developed by Casey Schaufler and integrated into the Linux Kernel since version 2.6.25. It’s most prominent usages are in the operating system Tizen and in the Automotive Grade Linux project. Smack stands for “Simplified Mandatory Access Control Kernel”, but I like to call it “the small brother of SELinux”, because it works quite similar to how SELinux works. By setting labels on files and processes, administrators are able to control which process can perform what actions on which objects. It is just much simpler than SELinux, it is integrated into the kernel and there is no need of userspace program awareness. This makes it easy to activate Smack without destroying the whole system immediately. There is a catch though — this is a major LSM and the LSM framework currently supports only one major LSM at runtime. When Smack is active AppArmor and SELinux must be deactivated. The Linux kernel community is currently working on a stacking functionality. This shouldn’t be a problem for our current target, as this blog post covers the cases where SELinux is not available and AppArmor is not sufficient.
Smack basics
I strongly encourage you to read the whole Smack documentation, because it is not so long, not hard to understand and sufficient for deep knowledge. Nevertheless I would cover some basics here needed for the rest of this post:
1. when the module is activated (almost) every object in the system is labeled with “_” (called “floor”);
2. objects with the same label can interact with each other without limitations;
3. every object can read or execute objects labeled “_”;
4. files are labeled in three categories: “access”,” execute” and “memory map”;
5. folders have one additional label, called “transmute”, which may be either unset (= false), or TRUE;
6. additional access rules can be defined in sub-files of the folder “/etc/smack/accesses.d“ and each line of each sub-file is a rule;
7. the rule line takes the form: “subject object rights”. “Subject” is a 255-long label of the process executing the action. “Object” is a 255-long label of the target of the action. “Rights” is a string combination of any of the letters “r” (read), “w” (write), “x” (execute), “a” (append), “t” (transmute) and “l” (lock). If instead of any letters a dash (“-”) is placed, then all actions are restricted;
8. the labels are applied as extended attributes to files. The “exec” attribute determines, which label the process started from a given file will run as.
Activating Smack on the system
In order to check if the kernel has been configured with Smack support, execute the following command as root:
# cat /proc/config.gz | gunzip | grep -i smack
CONFIG_SECURITY_SMACK=y
CONFIG_SECURITY_SMACK_BRINGUP=y
CONFIG_SECURITY_SMACK_NETFILTER=y
CONFIG_SECURITY_SMACK_APPEND_SIGNALS=
Search for the following line: “CONFIG_SECURITY_SMACK=y”. If it doesn’t exist or is saying something else, then a kernel recompilation is needed. Once the kernel is Smack capable, Smack can be activated using the “lsm=” kernel boot parameter. Check the current configuration using the following command executed as root:
# cat /sys/kernel/security/lsm
capability,lockdown,yama,apparmor
If you see SELinux, AppArmor or Tomoyo, you have to remove it. Place “smack” at the end. Remove also the part saying “capability” — this one will be added automatically by the kernel. The newly created string has to be prepended with “lsm=” and added to the end of your Kernel boot line. On Ubuntu or Arch this happens by modifying /etc/default/grub and calling update-grub after that. Here is how the config line on one of my machines looks like:
# cat /etc/default/grub | grep lsm=
GRUB_CMDLINE_LINUX_DEFAULT=”quiet udev.log_priority=3 lsm=lockdown,yama,smack”
Smack also needs a special filesystem mounted in /sys. To configure this, place the following entry line in your /etc/fstab:
# cat /etc/fstab | grep smack
smackfs /sys/fs/smackfs smackfs defaults 0 0
The next time the system is rebooted Smack will be active. Although not needed, I strongly recommend downloading and using the Smack utilities.I had to compile them from source for Arch and Ubuntu, but this was very straightforward — autoconf and libtool are required packages.
Once the system is rebooted and the tools are available, call the following command as root in order to check the Smack status:
# smackctl status
SmackFS is mounted to /sys/fs/smackfs/
To check the current labeling of a file call
# chsmack /path/to/folder/*
/path/to/folder/test.txt access=”_”
You will notice per default all files report access=”_” (floor). To set new labels use:
# chsmack -a {access-label} -e {execute-label} -m {mmap-label} /path/to/folder/test.txt
To remove labels use:
# chsmack -A -E -M /path/to/folder/test.txt
Now that labels have been set, rules can be created. If we create a file called:
/etc/smack/accesses.d/custom
place the following line there:
secret_exec secret_file rwa
and then call
# smackctl apply
Smack will now allow only processes with the label secret_exec to read, write and append to files with label secret_file. Processes with this label will also be able to read and execute files with the default “_” (floor) label. In order to allow more privileges to such processes, add another line to the same file:
secret_exec _ rwaxtl
Don’t forget to execute
# smackctl apply
again.
Securing containers
Now that the basics and how to activate and use Smack have been discussed, it is time to discuss the idea of this PoC. What if we define three Smack groups: “_”, “host” and “container”, and we allow full access between “_” and “host”, and also full access from “host” and “_” to “container”, and from “container” to “_”, but no access from “container” to “host”? Describing this as a table looks like this:
In terms of rules this looks like this:
# cat /etc/smack/accesses.d/host
host _ rwxatl
_ host rwxatl
host container rwxatl# cat /etc/smack/accesses.d/containers
container _ rwxatl
_ container rwxatl
What this effectively means, is that everything we label as “host” is not accessible by anything labeled as “container”. Let’s now imagine, we have two containers and we give them the labels “container1” and “container2”. The table then looks like this:
Not only both containers cannot touch anything labeled “host”, but they cannot also touch each other’s files. The table can be continued indefinitely for as many containers as needed. But this also means, there must be a system, a daemon, a process which constantly monitors what containers are running and updating the Smack configuration.
How do Kubernetes containers work?
Happily for us there is such a system. Let’s take a 10000ft look of how Kubernetes controls containers:
1. when the kubelet wants to start a container, it talks to a special interface plugin found inside the containerd daemon called Container Runtime Interface (CRI)
2. containerd prepares the container by extracting the image, doing some basic mounts, setting properties, privileges, etc. and launches the shim, which is a “translator” between containerd and runc;
3. the container actually is run by a program called runc. Runc performs the final mounts and the final configuration of the namespace;
4. runc then calls the “execve” syscall, which essentially replaces the runc process with the new process. The new process itself is the container executable, which kubelet wants to start.
Incorporating Smack into containerd
The containerd daemon is a system, which knows all Kubernetes containers running on the system. It keeps track of them, it monitors them, it knows their state. Furthermore it takes care of the “snapshotting”. That is the process of extracting an image and preparing the files to be mounted as part of a new container. Exactly at this point in time, it can set the proper Smack labels, e.g.:
# chsmack -a container0123456789deadbeef -e container0123456789deadbeef -r /path/to/extracted/container/files
where “0123456789deadbeef” is just the generated container hash (which containerd also uses for other things).
Containerd can then create the appropriate accesses.d rule files and apply them. Or it can even use the /sys/fs/smack/load2 interface for direct rule loading. Administrators can configure the host systems by setting the “host” label to mission critical items, e.g.: all files in the /lib* folders, everything in /etc, parts of the /usr folder, etc.
When new pods are now started, each of them will run with its distinct Smack label and is going to be able to touch its own files. But even if something escapes the container, it will not be able to touch other containers’ files. It will also not be able to read or modify critical node files.
The proof of concept
In my proof of concept I went for a bit different and easier approach — I opted out for one common label “container” for all containers. And for ease of development I decided to move the Smack label setting code into runc. Here is how the code looks:
standard_init_linux.go:
Now, if we start a pod with the following definition:
We can see the effects of Smack:
$ kubectl exec -ti hello — /bin/sh
/ # id
uid=0(root) gid=0(root) groups=1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)/ # ls /host
ls: /host/etc: Permission denied
bin lost+found run
boot mnt sbin
desktopfs-pkgs.txt opt srv
dev path sys
home proc tmp
lib root usr
lib64 rootfs-pkgs.txt var/ # cat /host/etc/shadow
cat: can’t open ‘/host/etc/shadow’: Permission denied/ #
I even experimented by executing as root on the host:
# find / -type f -exec chsmack -a host {} +
# find / -type d -exec chsmack -a host -t {} +
This labels each file and directory with the “host” label on the node, which in terms renders all files on the host system unreadable to the containers. Some files should be still left unmarked, as they’re needed by etcd and the kubernetes-api pods:
- /var/lib/etcd
- /etc/kubernetes
Furthermore, files in the /run/containerd folder should not be relabeled, as there are the container mounts.
Final thoughts
Smack is not sufficient. It is only one layer of a huge onion, which Kubernetes cluster operators must take care of — seccomp profiles, Smack labels, admission hooks, UID and GID mappings, read-only root filesystems, etc. But Smack is a step in the right direction.
Follow-up blog post
Hey everyone! I already published the second post about Kubernetes and Smack which offers not only insights how to get close to production state, but also my code modifications. Go check it out!