Chapter 7: Security Boundaries

A container's security boundary is the assembled effect of several Linux subsystems — capabilities, seccomp, AppArmor or SELinux, user namespaces, noNewPrivileges, masked paths, and the device cgroup BPF program — each tunable independently.

Safety: most examples need root. The seccomp and capabilities demonstrations are harmless on a VM. The MAC examples assume the relevant LSM is already loaded and configured by the distribution. Use a disposable Linux VM. Examples were checked on Ubuntu 24.04 (AppArmor) and a Fedora 40 VM (SELinux).

What The Controls Are For

Three threat classes shape what the controls are for:

Container → host escape — a process inside the container gaining privileges or visibility on the host.
Container → container interference — one container reading another's data, signaling its processes, or starving its resources.
Container → external resource abuse — a container performing actions outside its intended scope.

Capabilities and seccomp limit (1) and (3). MAC (AppArmor, SELinux) covers (1) and (2). User namespaces strengthen (1) at the cost of operational complexity. Cgroups address resource starvation in (2). No single control covers everything; the next sections cover each one, then the chapter returns to what --privileged actually loosens.

Linux Capabilities

Capabilities split the historical "root or not" model into ~40 individually grantable units. Setting capabilities is the runtime's job; reading them back is yours.

# What capabilities does the current shell have?
grep ^Cap /proc/self/status
# CapInh: 0000000000000000
# CapPrm: 0000000000000000
# CapEff: 0000000000000000
# CapBnd: 000001ffffffffff
# CapAmb: 0000000000000000

Five hex bitmaps, one per set: Inheritable, Permitted, Effective, Bounding, Ambient. The bounding set is the upper bound the process can ever hold; for an unprivileged shell it is "all caps known to this kernel" because a setuid root binary could push more in. For a runc-launched container the bounding set is restricted to the OCI process.capabilities.bounding list.

Decode the bitmaps with capsh:

capsh --decode=000001ffffffffff
# 0x000001ffffffffff=cap_chown,dac_override,...,cap_checkpoint_restore

To see the difference inside a container:

docker run --rm alpine:3.20 sh -c 'grep ^Cap /proc/self/status'
# CapInh: 0000000000000000
# CapPrm: 00000000a80425fb
# CapEff: 00000000a80425fb
# CapBnd: 00000000a80425fb
# CapAmb: 0000000000000000

docker run --rm alpine:3.20 sh -c '
  apk add -q libcap
  capsh --decode=$(grep ^CapBnd /proc/self/status | cut -f2)
'
# 0x00000000a80425fb = cap_chown,dac_override,fowner,fsetid,kill,setgid,
# setuid,setpcap,net_bind_service,sys_chroot,mknod,audit_write,setfcap

Thirteen capabilities — the conventional default container set. Notably absent: CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_SYS_TIME, CAP_NET_RAW, CAP_SYS_MODULE. The container's "root" cannot configure interfaces, load kernel modules, or set the clock.

Compare with --privileged:

docker run --rm --privileged alpine:3.20 sh -c 'grep ^CapBnd /proc/self/status'
# CapBnd: 000001ffffffffff   <- everything

--privileged clears the bounding set, drops the seccomp profile, removes the AppArmor / SELinux profile, and gives the device cgroup a wildcard rule. It is the easiest way to make a container "work," and it removes most of what made it a container.

File Capabilities

Capabilities can also live on executables as the security.capability xattr:

sudo apt-get install -y libcap2-bin
getcap -r /usr/bin /usr/sbin 2>/dev/null | head
# /usr/bin/ping cap_net_raw=ep
# /usr/bin/newuidmap cap_setuid+ep
# /usr/bin/newgidmap cap_setgid+ep

When ping is exec'd, its file capabilities become permitted+effective on the new process. This is how a non-root user can ping despite needing CAP_NET_RAW — the binary brings the capability with it. Inside a container with noNewPrivileges set, file capabilities are ignored on execve; setuid bits and security.capability xattrs both stop working as escalation vectors.

`noNewPrivileges`

A one-bit prctl(2) that, once set, prevents the process from gaining privileges via execve:

sudo apt-get install -y libcap2-bin

# A non-privileged shell. ping works because of file capabilities.
ping -c1 127.0.0.1 > /dev/null && echo "ping ok"
# ping ok

# Set no_new_privs, then exec ping. File capabilities are ignored.
exec setpriv --no-new-privs ping -c1 127.0.0.1
# ping: socktype: SOCK_RAW

setpriv --no-new-privs is the user-space wrapper around prctl(PR_SET_NO_NEW_PRIVS, 1). Once set, the bit cannot be cleared. OCI containers default to it.

Seccomp

Seccomp filters syscalls. To watch it work, compile a tiny program that sets a filter and tries something it has just blocked. The shell-friendliest path is prctl via setpriv for the no-new-privs bit and a small Python program for the seccomp filter.

A C example using libseccomp:

sudo apt-get install -y libseccomp-dev gcc

cat > /tmp/seccomp-demo.c <<'EOF'
#include <seccomp.h>
#include <stdio.h>
#include <unistd.h>

int main(void) {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(uname), 0);
    seccomp_load(ctx);
    seccomp_release(ctx);

    char buf[1024];
    if (gethostname(buf, sizeof buf) < 0) {
        perror("gethostname");
    } else {
        printf("hostname: %s\n", buf);
    }

    struct utsname u;
    if (uname(&u) < 0) {
        perror("uname");
    } else {
        printf("uname: %s\n", u.sysname);
    }
    return 0;
}
EOF

gcc /tmp/seccomp-demo.c -lseccomp -o /tmp/seccomp-demo
/tmp/seccomp-demo
# hostname: <something>
# uname: Operation not permitted

gethostname(2) is allowed, uname(2) is forced to return EPERM. The OCI spec's linux.seccomp field describes the same filter as JSON; runc compiles it to BPF before exec. Docker and containerd ship a default profile that blocks ~50 syscalls including kexec_load, keyctl, add_key, init_module, mount, umount2, swapon, clock_settime, and reboot. The full profile is at containerd/contrib/seccomp/seccomp_default.go (or moby/profiles/seccomp/default.json for the Docker copy).

To inspect the filter on a running container:

docker run --rm -d --name demo alpine:3.20 sleep 600
PID=$(docker inspect -f '{{.State.Pid}}' demo)
grep -E 'Seccomp|^Sec' /proc/$PID/status
# Seccomp:        2
# Seccomp_filters: 1
docker stop demo

Seccomp: 2 is SECCOMP_MODE_FILTER. Seccomp_filters: 1 is the number of attached BPF programs.

AppArmor (Ubuntu/Debian)

AppArmor is path-based MAC. Containers run inside a profile that the kernel enforces alongside DAC and capabilities.

# Confirm AppArmor is enabled.
sudo aa-status | head
# apparmor module is loaded.
# 70 profiles are loaded.

# Find the profile a running container is using.
docker run --rm -d --name demo alpine:3.20 sleep 600
PID=$(docker inspect -f '{{.State.Pid}}' demo)
sudo cat /proc/$PID/attr/current
# docker-default (enforce)
docker stop demo

Inside docker-default, writes to /proc/sys, /proc/sysrq-trigger, /sys/kernel, and most of /sys are denied. Mount operations are blocked except for the ones the runtime itself sets up before the profile attaches.

Try writing to a kernel parameter from inside a default-profile container:

docker run --rm alpine:3.20 sh -c 'echo 1 > /proc/sys/kernel/sysrq'
# sh: can't create /proc/sys/kernel/sysrq: Permission denied

Without AppArmor, this would be allowed if the container had CAP_SYS_ADMIN (it doesn't by default), or if the kernel parameter happened to be writable for non-root (it isn't here).

Per-pod AppArmor profiles in Kubernetes use securityContext.appArmorProfile (since 1.30, GA). The named profile must already be loaded on the node.

SELinux (RHEL/Fedora)

SELinux is label-based MAC. Every process and every file has a security context: user:role:type:level.

# On a Fedora/RHEL host with SELinux in enforcing mode:
getenforce
# Enforcing

# Process contexts.
ps -eZ | head
# system_u:system_r:init_t:s0   1 ?  init
# ...

# Container process context.
podman run --rm -d --name demo registry.access.redhat.com/ubi9/ubi-minimal sleep 600
PID=$(podman inspect -f '{{.State.Pid}}' demo)
sudo cat /proc/$PID/attr/current
# system_u:system_r:container_t:s0:c123,c456
podman stop demo

The process type container_t is the policy bucket for normal containers. The :s0:c123,c456 suffix is MCS (Multi-Category Security): each container gets a unique pair of categories, and the policy permits access only when the categories of subject and object match. Two containers running as the same container_t cannot read each other's files because their MCS labels differ.

To see the file labels under a container's rootfs:

sudo ls -lZ /var/lib/containers/storage/overlay/<id>/diff/etc/ | head
# system_u:object_r:container_file_t:s0:c123,c456 ...

container_file_t is the policy bucket for container-managed files. The MCS pair matches the process's, which is what makes the access decision come out "allowed."

When SELinux denies an action that DAC and capabilities would allow, the audit log records it:

sudo ausearch -m AVC -ts recent | tail
# type=AVC msg=audit(...): avc:  denied  { read } for  pid=...
#   scontext=...:container_t:s0:c123,c456
#   tcontext=...:container_file_t:s0:c789,c012
#   tclass=file

An AVC denial line tells you which container (the categories), which kind of object (the type), and which permission the policy was missing (the action).

Distros ship either AppArmor or SELinux, not both. The kernel's LSM framework can stack additional minor LSMs (Yama, Lockdown, BPF-LSM) alongside, but the path-vs-label MAC layer is one or the other.

User Namespaces As A Security Boundary

User namespaces let a process hold root-equivalent capabilities inside the namespace without holding any host-level privilege. The chapter on namespaces showed how to create one; here is what the security implication looks like.

# Inside a user namespace where I am "root":
unshare --user --map-root-user -- bash -c '
  # Try to mount the host /proc.
  mount -t proc proc /mnt 2>&1
  # mount: /mnt: permission denied. (only privileged user can mount)

  # But create a new mount namespace - mount in there works:
  unshare --mount -- bash -c "mount -t tmpfs tmpfs /mnt && echo mounted"
  # mounted
'

The CAP_SYS_ADMIN the inner shell holds applies to resources owned by its user namespace. Mount namespaces created from inside that user namespace are owned by it; the host's mount namespace is not. A compromise that escalates to root inside such a container lands at an unprivileged host UID instead of host root — the threat-model improvement rootless containers exist for.

Kubernetes has spec.hostUsers: false (beta in 1.30, on track for GA) to give every pod its own user namespace. Inside, the pod's processes appear to run as their requested UIDs; outside, those UIDs map to a high-numbered host UID range (e.g. 100000-165535). A compromise that escalates to "root" inside the pod gets host UID 100000, which on the host is unprivileged.

Caveats:

The rootfs has to be chown-ed to match the namespace's mapping, or idmap mounted so the kernel performs the translation at access time.
File capabilities, ACLs, and setuid binaries on shared filesystems still run as the namespace-mapped UID, which is usually fine but occasionally surprising.

Masked And Read-Only Paths

/proc and /sys aggregate host-wide information that the PID and mount namespaces do not isolate. Two OCI fields plug the leaks:

linux.maskedPaths — runc implements this by bind-mounting /dev/null over files and an empty tmpfs over directories. The OCI spec requires the path be inaccessible, not the specific mechanism.
linux.readonlyPaths — remount the path read-only.

Default masked set in most runtimes:

/proc/asound
/proc/acpi
/proc/kcore
/proc/keys
/proc/latency_stats
/proc/timer_list
/proc/timer_stats
/proc/sched_debug
/proc/scsi
/sys/firmware
/sys/devices/virtual/powercap

Default read-only:

/proc/bus
/proc/fs
/proc/irq
/proc/sys
/proc/sysrq-trigger

To verify:

docker run --rm alpine:3.20 sh -c 'cat /proc/kcore' 2>&1 | head
# (no output -- /dev/null is mounted over it)
echo "exit code: $?"

docker run --rm alpine:3.20 sh -c 'echo 1 > /proc/sysrq-trigger' 2>&1
# sh: can't create /proc/sysrq-trigger: Read-only file system

/proc/kcore is masked because it would otherwise let a process with CAP_SYS_RAWIO read kernel memory. /proc/sysrq-trigger is read-only because writes to it can crash, reboot, or sync the host.

Each entry was added in response to a public disclosure — for example, /proc/sched_debug was masked after it was shown to leak kernel pointers. Any new /proc or /sys interface that exposes host state is a candidate.

Device Access (cgroup v2 + eBPF)

cgroup v2 has no devices.allow file. Device policy is enforced by an eBPF program of type BPF_PROG_TYPE_CGROUP_DEVICE attached to the cgroup. runc compiles the OCI linux.resources.devices list into BPF and attaches it.

docker run --rm -d --name demo alpine:3.20 sleep 600
CGROUP=$(docker inspect -f '{{.HostConfig.CgroupParent}}/docker-{{.Id}}.scope' demo)

sudo bpftool cgroup tree /sys/fs/cgroup$CGROUP 2>/dev/null || \
  sudo bpftool cgroup tree | grep -A1 "$(docker inspect -f '{{.Id}}' demo | head -c12)"
# /sys/fs/cgroup/.../docker-<id>.scope
# ID  AttachType      AttachFlags     Name
# X   cgroup_device                   sd_devices
docker stop demo

Try to access a device that is not in the allow list:

docker run --rm alpine:3.20 sh -c '
  cat /dev/null > /dev/null  # allowed
  echo "null ok"
  cat /dev/sda 2>&1 | head -1  # not allowed
'
# null ok
# cat: /dev/sda: Operation not permitted

The kernel returns EPERM because the BPF program denies the open. Privileged containers attach a BPF program that allows everything (a *:* rwm).

Putting It Together

A non-privileged container's actual boundary, in the order it is built:

Namespaces create separate views.
Mount setup with masked and read-only paths closes /proc and /sys leaks.
Capability bounding set strips kitchen-sink privileges.
noNewPrivileges prevents privilege gain across exec.
Seccomp filters dangerous syscalls.
AppArmor or SELinux denies operations that DAC and capabilities would allow.
Cgroup device BPF restricts which devices work.
Cgroups bound resource consumption.
User namespace mapping (when enabled) makes container root unprivileged on the host.

Each layer is independently configurable. --privileged clears most of these layers at once, which is why it is rarely the right answer when a container does not work.

Where This Goes

Part 3 picks up the OCI runtime side: how an OCI bundle is laid out, what config.json looks like in detail, and how runc translates the spec into the kernel state we have just spent four chapters cataloguing.

Sources And Further Reading

capabilities(7): https://man7.org/linux/man-pages/man7/capabilities.7.html
seccomp(2): https://man7.org/linux/man-pages/man2/seccomp.2.html
Linux seccomp_filter docs: https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html
AppArmor: https://gitlab.com/apparmor/apparmor/-/wikis/home
SELinux project: https://github.com/SELinuxProject
container-selinux: https://github.com/containers/container-selinux
prctl(2) PR_SET_NO_NEW_PRIVS: https://man7.org/linux/man-pages/man2/prctl.2.html
OCI runtime spec, capabilities: https://github.com/opencontainers/runtime-spec/blob/main/config.md#capabilities
OCI runtime spec, seccomp: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#seccomp
Default Docker seccomp profile: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json
containerd seccomp default: https://github.com/containerd/containerd/blob/main/contrib/seccomp/seccomp_default.go
BPF cgroup device program: https://docs.kernel.org/bpf/prog_cgroup_device.html