Chapter 6: Container Filesystems

The container's filesystem is a three-stage pipeline: image content arrives as digest-addressed blobs, a snapshotter turns those blobs into a stack of directories, and runc swaps that stack in as / inside a fresh mount namespace.

Safety: the OverlayFS and pivot_root examples mutate kernel mount state and need root. The image-inspection examples are safe and read-only. Use a disposable Linux VM for the privileged sections. Examples were checked on Ubuntu 24.04 with kernel 6.8 and containerd 1.7.

What An OCI Image Looks Like On Disk

An OCI image is a content-addressed bundle: one manifest, one image config, and a list of layers, all addressed by SHA-256 digest. The easiest way to see the structure is to pull an image into an OCI layout directory:

sudo apt-get install -y skopeo jq
skopeo copy docker://alpine:3.20 oci:./alpine-oci:3.20
ls alpine-oci/
# blobs/  index.json  oci-layout
ls alpine-oci/blobs/sha256/ | head

oci-layout is a marker file. index.json is the entry point — for a single-arch image, it points at one manifest; for a multi-arch image, it points at an image index that selects per {architecture, os}.

Walk the chain:

# index.json -> manifest digest
MANIFEST_DIGEST=$(jq -r '.manifests[0].digest' alpine-oci/index.json | sed 's/sha256://')
jq . alpine-oci/blobs/sha256/$MANIFEST_DIGEST

# manifest -> config digest and layer digests
jq '{config: .config.digest, layers: [.layers[].digest]}' \
  alpine-oci/blobs/sha256/$MANIFEST_DIGEST

# config -> rootfs.diff_ids
CONFIG_DIGEST=$(jq -r '.config.digest' alpine-oci/blobs/sha256/$MANIFEST_DIGEST | sed 's/sha256://')
jq '.rootfs' alpine-oci/blobs/sha256/$CONFIG_DIGEST
# {
#   "type": "layers",
#   "diff_ids": ["sha256:..."]
# }

Three digest types appear in image storage. Confusing them is the source of most layer-mismatch and snapshot-deduplication bugs:

To verify a layer's DiffID by hand:

LAYER_DIGEST=$(jq -r '.layers[0].digest' \
  alpine-oci/blobs/sha256/$MANIFEST_DIGEST | sed 's/sha256://')
zcat alpine-oci/blobs/sha256/$LAYER_DIGEST | sha256sum
# This should match the diff_ids[0] entry in the image config.

If the registry's compressed-layer digest does not produce a tar whose uncompressed SHA-256 matches the config's DiffID, the snapshotter refuses to use the result.

Layer Tar Conventions

A layer is a tar archive with two encoded modifications:

To see them in a real image, build one that deletes a file:

mkdir build && cat > build/Dockerfile <<'EOF'
FROM alpine:3.20
RUN rm /etc/motd
EOF
docker buildx build --output type=oci,dest=motd.tar build/
mkdir motd-oci && tar -C motd-oci -xf motd.tar

# Find the top layer of the new image.
MANIFEST=$(jq -r '.manifests[0].digest' motd-oci/index.json | sed 's/sha256://')
TOP_LAYER=$(jq -r '.layers[-1].digest' motd-oci/blobs/sha256/$MANIFEST | sed 's/sha256://')

# List its contents.
zcat motd-oci/blobs/sha256/$TOP_LAYER | tar -tvf - | grep -E '\.wh\.|motd'
# -rw-r--r-- 0/0 0 ... etc/.wh.motd

The zero-byte etc/.wh.motd is how RUN rm /etc/motd is encoded into a layer the snapshotter can apply.

containerd Content Store

When containerd pulls an image, it stores every blob in its content store, addressed by digest. The default location:

/var/lib/containerd/io.containerd.content.v1.content/
  blobs/sha256/<digest>
  ingest/<id>/

To see the content of a running containerd instance:

sudo ls /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/ | head
sudo ctr content ls | head

The content store treats every blob as opaque bytes addressed by digest and verifies them on read. The image service is what gives the bytes meaning: it tracks that alpine:3.20 resolves to a particular manifest digest, which references a config and a list of layers.

Garbage collection treats content and snapshots together. A blob is reachable if some image, lease, or snapshot references it. Leases keep arbitrary content alive during in-flight operations:

sudo ctr leases ls

Snapshotters

The snapshotter turns image layers into a mountable filesystem. containerd defines an interface; plugins implement it. The default on Linux is the OverlayFS snapshotter.

sudo ctr snapshot ls | head
# KEY                                                                                 PARENT                                                                              KIND
# sha256:abcd... (chainID for layer 0)                                                                                                                                  Committed
# sha256:efgh... (chainID for layer 0+1)                                              sha256:abcd...                                                                      Committed
# k8s.io/12/<container-id>                                                            sha256:efgh...                                                                      Active

Three kinds of snapshot:

The snapshotter's directory layout:

/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/
  metadata.db
  snapshots/<id>/fs/        # the layer's directory
  snapshots/<id>/work/      # OverlayFS work dir (only for active)

fs/ is the data; work/ is OverlayFS's scratch area for atomic operations, required to be on the same filesystem as the upper layer.

Build An Overlay By Hand

To see the OverlayFS internals that the snapshotter wraps, mount one yourself:

sudo mkdir -p /tmp/ov/{lower1,lower2,upper,work,merged}

# Lower1 = base layer
echo "from lower1" | sudo tee /tmp/ov/lower1/file-a > /dev/null
echo "lower1 only" | sudo tee /tmp/ov/lower1/file-b > /dev/null

# Lower2 = stacked above lower1 (right-most = top)
echo "from lower2 (overrides lower1)" | sudo tee /tmp/ov/lower2/file-a > /dev/null

# Upper = writable
sudo mount -t overlay overlay \
  -o lowerdir=/tmp/ov/lower2:/tmp/ov/lower1,upperdir=/tmp/ov/upper,workdir=/tmp/ov/work \
  /tmp/ov/merged

# What the merged view shows:
cat /tmp/ov/merged/file-a  # from lower2 (overrides lower1)
cat /tmp/ov/merged/file-b  # lower1 only

# Write to merged. Changes go to upper.
echo "new line" | sudo tee -a /tmp/ov/merged/file-a > /dev/null
ls /tmp/ov/upper/
# file-a

The lowerdir syntax reads right-to-left: the last directory is the bottom of the stack, the first is the top. Writes always land in upperdir.

To delete a lower-layer file from the merged view:

sudo rm /tmp/ov/merged/file-b
ls /tmp/ov/merged/
# file-a
ls -la /tmp/ov/upper/
# c--------- 0 0 ... file-b   <- character device, major:minor 0:0

OverlayFS represents whiteouts as character devices with major:minor 0:0, not as the OCI tar .wh.<name> convention. The unpacker translates between them at unpack time.

To clean up:

sudo umount /tmp/ov/merged
sudo rm -rf /tmp/ov

Rootless OverlayFS

trusted.* xattrs (used by OverlayFS for opaque markers and redirects) require CAP_SYS_ADMIN on the host. That means rootless OverlayFS does not work on older kernels. Linux 5.11 added support for OverlayFS in user namespaces; on older kernels, rootless tooling (rootless Docker, podman) falls back to fuse-overlayfs — a userspace reimplementation that uses user.* xattrs and accepts the performance cost.

To check which one is in use under a rootless engine:

podman info --format '{{.Store.GraphDriverName}}'
# overlay (kernel) or overlay (fuse-overlayfs)

Mount Setup In runc

When runc starts a container, it executes a fixed sequence inside a fresh mount namespace. Watching it from outside is awkward (the work happens between clone(2) and execve(2) of the user process); reading the kernel-side view is easier. After a container has started, look at its mounts via /proc/<pid>/mountinfo:

docker run --rm -d --name demo alpine:3.20 sleep 600
PID=$(docker inspect -f '{{.State.Pid}}' demo)
sudo cat /proc/$PID/mountinfo
# 1 0 0:34 / / rw,relatime master:1 - overlay overlay rw,lowerdir=...
# 2 1 0:35 / /proc rw,nosuid,nodev,noexec,relatime - proc proc ...
# 3 1 0:36 /pts /dev/pts rw,nosuid,noexec,relatime - devpts devpts ...
# ...

The first line is the rootfs — an OverlayFS mount with a lowerdir chain corresponding to the image's layers and an upperdir corresponding to the snapshotter's active snapshot. The rest are the standard runtime mounts: /proc, /dev as tmpfs, /dev/pts, /dev/shm as tmpfs, /dev/mqueue, /sys, and a read-only cgroup2 bind.

docker stop demo

The order runc applies these is fixed because pivot_root has constraints:

  1. Make the new mount namespace's root mount private to prevent propagation back to the host.
  2. Mount the rootfs (OverlayFS or whatever the snapshotter returns).
  3. Mount the special filesystems described in config.json's mounts array.
  4. Create device nodes (bind from host) and link standard ones (/dev/stdin/proc/self/fd/0, etc.).
  5. Apply linux.maskedPaths (bind /dev/null over them).
  6. Apply linux.readonlyPaths (remount as read-only).
  7. pivot_root(2) to swap the prepared rootfs in as /.
  8. Unmount and remove the old root.
  9. Set propagation on / per linux.rootfsPropagation.

The runc source for this lives in libcontainer/rootfs_linux.go; the OCI spec describes the requirements in runtime-linux.md.

pivot_root By Hand

To see what pivot_root(2) is doing without runc in the way:

sudo unshare --mount -- bash
# Inside the new mount namespace.

mkdir -p /tmp/newroot/{old,bin,proc,sys,dev}
mount -t tmpfs tmpfs /tmp/newroot       # placeholder rootfs
cp /bin/busybox /tmp/newroot/bin/        # adjust if no busybox; use bash etc.

# Make / private so pivot_root constraints are met.
mount --make-rprivate /

# Bind-mount the rootfs onto itself; pivot_root requires the new and
# old roots to be on different mounts.
mount --bind /tmp/newroot /tmp/newroot

cd /tmp/newroot
pivot_root . old

# After this, `.` is the new /; `/old` is the old root.
exec /bin/busybox sh
ls /old
# usr  etc  var  ... (the host's old root)

umount -l /old
ls /
# bin  old  proc  sys  dev  (now without /old)

This is exactly what runc does, with one important addition: runc unmounts old root before exec'ing the user process so the container cannot escape via the old root. The example above leaves it mounted for inspection.

OCI Mounts In Practice

The mounts array in config.json is what the runtime applies in step 3 above. Each entry has destination, type, source, and options. The conventional default set:

Destination Type Source Notes
/proc proc proc Reflects PID namespace. Required for ps, /proc/self/.
/dev tmpfs tmpfs mode=755.
/dev/pts devpts devpts newinstance,ptmxmode=0666.
/dev/shm tmpfs shm mode=1777,size=65536k.
/dev/mqueue mqueue mqueue Matches IPC namespace.
/sys sysfs sysfs Often ro for non-privileged.
/sys/fs/cgroup cgroup2 cgroup Read-only bind, view limited by cgroup namespace.

Bind mounts (type: bind, source: <host-path>) are how host paths show up inside containers — Docker volumes, Kubernetes hostPath, and secret mounts all use them. The options field controls propagation: rprivate (default) does not propagate; rslave accepts host changes; rshared propagates both ways. Kubernetes' mountPropagation: HostToContainer corresponds to rslave; Bidirectional corresponds to rshared.

To see a real container's mount config:

docker run --rm -d --name demo -v /tmp:/host-tmp alpine:3.20 sleep 600
PID=$(docker inspect -f '{{.State.Pid}}' demo)
sudo cat /proc/$PID/mountinfo | grep host-tmp
# .../tmp /host-tmp rw,relatime - ext4 /dev/...
docker stop demo

Where This Goes

The next chapter covers the security controls that compose with the filesystem to form the full container boundary — capabilities, seccomp, MAC, masked paths. The masked paths in particular layer on top of the mount setup described here: they are bind mounts performed after the OCI mount table is in place.

Sources And Further Reading