Chapter 5: Cgroups v2

Namespaces decide what a process sees. Cgroups decide what it can use.

Safety: writing to /sys/fs/cgroup requires root. The example that triggers an OOM kill is harmless on a VM but will cause a real OOM event in the kernel log. Use a disposable Linux VM. Examples were checked on Ubuntu 24.04 with kernel 6.8 and systemd 255 in cgroup v2 unified mode.

Confirm v2 Is Active

Ubuntu 22.04+, Fedora 31+, Debian 11+, and RHEL 9 default to v2. The v2 unified hierarchy is mounted as cgroup2 at /sys/fs/cgroup:

mount | grep cgroup2
# cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

If you see additional cgroup mounts under /sys/fs/cgroup/<controller>/, the system is in legacy v1 or hybrid mode and the rest of this chapter does not apply directly. To force unified mode, set systemd.unified_cgroup_hierarchy=1 on the kernel command line and reboot.

Walk The Tree

Every directory under /sys/fs/cgroup is a cgroup. The root is /sys/fs/cgroup itself.

ls /sys/fs/cgroup/ | head
# cgroup.controllers
# cgroup.max.depth
# cgroup.max.descendants
# cgroup.procs
# cgroup.subtree_control
# cgroup.threads
# cpu.pressure
# cpu.stat
# init.scope
# io.pressure
# ...
cat /sys/fs/cgroup/cgroup.controllers
# cpuset cpu io memory hugetlb pids rdma misc

cgroup.controllers lists what is available to this cgroup; cgroup.subtree_control lists what is enabled for children:

cat /sys/fs/cgroup/cgroup.subtree_control
# cpuset cpu io memory pids

Walking the tree as systemd has shaped it:

systemd-cgls --no-pager | head -30

The standard layout: system.slice/ for system services, user.slice/user-<uid>.slice/ for user sessions, and (when present) kubepods.slice/ or machine.slice/ for orchestrated workloads.

Make A Cgroup By Hand

Cgroups are directories. Creating one is mkdir:

sudo mkdir /sys/fs/cgroup/demo
ls /sys/fs/cgroup/demo/
# cgroup.controllers  cgroup.events  cgroup.procs  cgroup.stat
# cgroup.subtree_control  cgroup.threads  cgroup.type
# cpu.stat  cpu.weight  io.pressure  memory.pressure  ...

The controller-specific files visible here come from whichever controllers the parent has enabled in its subtree_control. To enable more, write the diff to the parent's subtree_control (write + to add, - to remove):

echo +memory +pids > /sys/fs/cgroup/cgroup.subtree_control 2>/dev/null \
  || sudo sh -c 'echo "+memory +pids" > /sys/fs/cgroup/cgroup.subtree_control'

A controller is enabled for a cgroup's children, not for the cgroup itself; the kernel docs call this the top-down constraint.

Move A Process In

Put a process into a cgroup by writing its PID to cgroup.procs:

sudo sh -c 'sleep 600 & echo $! > /sys/fs/cgroup/demo/cgroup.procs; wait'
# In another terminal:
cat /sys/fs/cgroup/demo/cgroup.procs
# 12345

A process can only be in one cgroup at a time. Writing a PID to a different cgroup's cgroup.procs moves it out of its previous cgroup atomically.

cgroup.threads works the same way but for individual threads, and only on cgroups whose cgroup.type is threaded.

Set A Memory Limit, Watch It OOM

Set a small memory limit, run a process that allocates more than that, and watch the kernel kill it within the cgroup:

sudo mkdir -p /sys/fs/cgroup/oom-demo
echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control > /dev/null
echo 50M | sudo tee /sys/fs/cgroup/oom-demo/memory.max > /dev/null

# Move the current shell into the cgroup, then exec a Python that allocates.
sudo sh -c 'echo $$ > /sys/fs/cgroup/oom-demo/cgroup.procs; \
  exec python3 -c "x=bytearray(200*1024*1024); print(\"alive\")"'
# Killed
echo $?
# 137  (128 + SIGKILL)

Confirm the OOM was scoped to the cgroup and did not affect anything else:

cat /sys/fs/cgroup/oom-demo/memory.events
# low 0
# high 0
# max 1
# oom 1
# oom_kill 1

memory.max is a hard limit. The kernel reclaims when the cgroup approaches it; if reclaim cannot free enough memory, the OOM killer fires inside the cgroup and picks a victim from the cgroup's processes.

For softer pushback before OOM, set memory.high:

echo 30M | sudo tee /sys/fs/cgroup/oom-demo/memory.high > /dev/null

The kernel will throttle allocations and force reclaim on this cgroup when it exceeds 30M, but will not kill anything. Combined with a higher memory.max, this gives the workload time to react.

CPU Caps And Weights

Two CPU files matter most:

cpu.weight — proportional share when CPUs are oversubscribed. Default 100. Range 1–10000.
cpu.max — <quota> <period> in microseconds. Default max 100000, meaning unlimited.

Set a half-CPU cap:

sudo mkdir -p /sys/fs/cgroup/cpu-demo
echo "+cpu" | sudo tee /sys/fs/cgroup/cgroup.subtree_control > /dev/null
echo "50000 100000" | sudo tee /sys/fs/cgroup/cpu-demo/cpu.max > /dev/null

# Run a busy loop in the cgroup.
sudo sh -c 'echo $$ > /sys/fs/cgroup/cpu-demo/cgroup.procs; \
  exec timeout 5 sh -c "while :; do :; done"'

# After it ends, look at cpu.stat:
cat /sys/fs/cgroup/cpu-demo/cpu.stat
# usage_usec     2500000
# user_usec      2500000
# system_usec       0
# nr_periods      ~50
# nr_throttled    ~50
# throttled_usec ~2500000

nr_throttled and throttled_usec show the cgroup hit its quota every period and was paused for the remainder. Kubernetes' CPU throttling alerts read these counters.

Limit Process Count

Forks happen all the time. A bug or attack can exhaust the host's pid space; pids.max defends against that:

sudo mkdir -p /sys/fs/cgroup/pid-demo
echo "+pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control > /dev/null
echo 5 | sudo tee /sys/fs/cgroup/pid-demo/pids.max > /dev/null

sudo sh -c 'echo $$ > /sys/fs/cgroup/pid-demo/cgroup.procs; \
  for i in 1 2 3 4 5 6 7 8; do \
    sleep 30 & echo "started $!"; \
  done'
# After the limit:
# sh: fork: retry: Resource temporarily unavailable

pids.events records the rejections:

cat /sys/fs/cgroup/pid-demo/pids.events
# max 4   <- the count of fork rejections

Read Pressure

PSI (Pressure Stall Information) is exposed at cpu.pressure, memory.pressure, and io.pressure in every cgroup. The values report the percentage of time some / all processes in the cgroup were stalled on the resource. PSI reflects contention rather than nominal load — a CPU at 80% utilization with no waiting tasks shows zero pressure, while a CPU at 40% with queued tasks shows non-zero:

cat /sys/fs/cgroup/oom-demo/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some is "at least one task stalled"; full is "all tasks stalled."

"No Internal Processes" In Practice

A non-root cgroup cannot both contain processes and have controllers enabled in its subtree_control that propagate to children. To see the rule fire:

sudo mkdir -p /sys/fs/cgroup/parent/child
sudo sh -c 'echo $$ > /sys/fs/cgroup/parent/cgroup.procs'
echo "+memory" | sudo tee /sys/fs/cgroup/parent/cgroup.subtree_control
# tee: /sys/fs/cgroup/parent/cgroup.subtree_control: Device or resource busy

Move the process out first, then the write succeeds:

sudo sh -c 'echo $$ > /sys/fs/cgroup/parent/child/cgroup.procs'
echo "+memory" | sudo tee /sys/fs/cgroup/parent/cgroup.subtree_control
# +memory

Containers always sit in leaf cgroups: the runtime creates a hierarchy of intermediate cgroups (slice → kubepods → pod → container), each of which has children but no processes, and puts the container's processes only in the leaf.

systemd Owns The Tree

On systemd-managed hosts (most production Linux), systemd owns the cgroup tree. The kernel enforces a single-writer model per cgroup directory — if two processes both try to manage the same cgroup, one will lose. systemd's answer is delegation: it creates a parent with Delegate=yes, marks it as owned by another manager, and stops touching what is underneath.

systemd-run is the convenient way to launch a transient unit and observe its cgroup placement:

sudo systemd-run --slice=demo.slice --unit=oneshot.service --scope sleep 60 &
systemctl status oneshot.service
# Look for "CGroup: /demo.slice/oneshot.service"
cat /sys/fs/cgroup/demo.slice/oneshot.service/cgroup.procs
# <pid of sleep>

When containerd is configured with the systemd cgroup driver, it does not write cgroup files directly. It asks systemd over D-Bus to create a transient scope for each container shim, and lets systemd populate the resource files. The OCI linux.cgroupsPath in this mode is a systemd path:

kubepods-besteffort.slice:cri-containerd:<container-id>

read as <slice>:<prefix>:<id>. The runtime creates a transient scope cri-containerd-<id>.scope under kubepods-besteffort.slice/.

The "cgroup driver" config in containerd, kubelet, and runc must agree. A common production failure: kubelet uses systemd, containerd uses cgroupfs, and pods fail to start because each side is trying to create cgroups the other does not see.

Delegation Enables Rootless

cgroup v2's delegation model is what gives a rootless container resource limits. Without v2 delegation, only the root user could write to cgroup files, and rootless containers would have no enforced limits. systemd creates a delegated subtree under user.slice/user-<uid>.slice/user@<uid>.service/ and grants the user ownership:

ls -ld /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service
# drwxr-xr-x 2 1000 1000 ... user@1000.service

Inside the delegated subtree, the user can mkdir, write +cpu +pids +memory to subtree_control (subject to what systemd enabled at the boundary), and place processes — without root.

# As an unprivileged user:
mkdir /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/me-demo
echo $$ > /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/me-demo/cgroup.procs

Rootless podman, buildah, and rootless containerd put their containers under this kind of subtree.

Device Access Is BPF, Not Files

cgroup v2 does not have devices.allow or devices.deny files. Device access policy is enforced by an eBPF program of type BPF_PROG_TYPE_CGROUP_DEVICE attached to the cgroup. runc compiles the OCI linux.resources.devices list into a BPF program at container start.

To see the attached program on a running container's cgroup:

sudo bpftool cgroup tree | head
# CgroupPath
# ID       AttachType      AttachFlags     Name
# /sys/fs/cgroup/system.slice/docker-<id>.scope
#     12  cgroup_device                   <prog-name>

v1 exposed device policy as devices.allow and devices.deny files; v2 exposes nothing in the cgroup directory. Tooling that walked the v1 files has to load the BPF program with bpftool cgroup show instead.

OCI Resource Mapping

The relevant linux.resources fields in config.json and where they land:

OCI field	v2 file
`memory.limit`	`memory.max`
`memory.reservation`	`memory.low`
`memory.swap`	`memory.swap.max`
`cpu.shares`	`cpu.weight` (rescaled from 1024 → 100 default)
`cpu.quota` / `cpu.period`	`cpu.max`
`cpu.cpus` / `cpu.mems`	`cpuset.cpus` / `cpuset.mems`
`pids.limit`	`pids.max`
`blockIO.weight`, `throttleReadBpsDevice`, etc.	`io.weight`, `io.max`
`devices`	BPF program attached via `BPF_PROG_TYPE_CGROUP_DEVICE`

The remap from v1 names to v2 files is the runtime's job, not the user's. The OCI spec keeps the v1-style names for compatibility; runc, crun, and youki translate.

Where This Goes

The next chapter covers the rootfs: content addressing, snapshotters, and how runc gets a process to see a custom /. Cgroups reappear in chapter 7 when the device cgroup BPF program comes up under the security boundary.

Sources And Further Reading

Linux cgroup v2 admin docs: https://kernel.org/doc/html/next/admin-guide/cgroup-v2.html
cgroups(7): https://man7.org/linux/man-pages/man7/cgroups.7.html
systemd cgroup delegation: https://systemd.io/CGROUP_DELEGATION/
PSI docs: https://docs.kernel.org/accounting/psi.html
BPF cgroup device program: https://docs.kernel.org/bpf/prog_cgroup_device.html
OCI runtime spec, Linux resources: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#control-groups