Chapter 2: What A Container Actually Is
A Linux container is a process — or a process tree — running with a deliberately configured environment. The kernel schedules it like any other process. What makes it a container is the configuration around it: what it can see, what it can use, which filesystem it sees as /, what authority it has, and how it reaches the network.
Why The Boundary Is Not A Hypervisor
VMs and containers both isolate workloads, but at different layers. A VM runs a guest kernel on hardware virtualization. A Linux container shares the host kernel — which is why startup is fast (process creation, not a boot) and why the container is tied to that kernel's version, config, and security state. A workload that needs a different kernel needs something more than a Linux container.
The container boundary is the assembled effect of namespaces, cgroups, mount setup, credentials, capabilities, seccomp, LSMs, device rules, and runtime policy. It is strong when the configuration is right, but its shape is nothing like a hypervisor's: a hypervisor isolates with a separate guest kernel and a virtual hardware interface, while a container is the host kernel applying scoping rules to one of its own processes. The practical consequence is that a kernel bug like CVE-2022-0185 (filesystem-context heap overflow) is a container-escape primitive; the same bug under a VM stops at the guest kernel.
Not An Image
An image is packaged content and metadata — in OCI terms, a manifest, an image config, and a set of layer blobs, all addressed by digest. It can sit in a registry or a local content store with no process running anywhere.
A container is what happens when that content is combined with runtime configuration to start a process. "Run an image" is shorthand for a fixed sequence: the runtime resolves the reference, fetches the manifest and layers, stores them by digest, hands them to a snapshotter that produces a writable root filesystem, and starts a process inside that filesystem under an OCI runtime spec. The image is an input. The container is the running result.
The Process At The Center
A container has an ordinary Linux process at its center, with everything any other Linux process has: a command, environment, credentials, file descriptors, signal handlers, parents and children, an exit status. The runtime does not replace any of that; it configures the world around it.
The consequences are mundane. When the configured process exits, the container's task is over. When it spawns children, the runtime and kernel track the whole tree. A signal from outside has to become a Linux signal delivered to the right PID. Bytes written to stdout have to land somewhere the runtime stack has wired up.
Namespaces: What The Process Can See
Namespaces wrap global system resources so a process sees a scoped instance of each one. Eight namespace types are relevant to containers:
- mount — the mount table;
- PID — the process ID space;
- network — interfaces, addresses, routes, firewall state, ports;
- UTS — hostname and domain name;
- IPC — System V IPC and POSIX message queues;
- user — UID and GID mappings;
- cgroup — the cgroup hierarchy view;
- time — offsets for
CLOCK_MONOTONICandCLOCK_BOOTTIME(added in Linux 5.6).
Inside a container, a process might believe it is PID 1; on the host it has another PID, and both are real within their respective views. Namespaces on their own do not limit memory, CPU, or process count, and an isolated mount namespace can still expose dangerous host paths if a careless bind mount puts them there.
cgroups: What The Process Can Use
cgroups organize processes into a hierarchy and attach resource controllers to it. Where namespaces govern visibility, cgroups govern consumption: memory limits and OOM behavior, CPU weights and quotas, cpuset placement, IO controls, pids limits, and the accounting that makes any of it observable.
Linux has shipped cgroup v2 since 4.5 (2016); systemd made it the default in v243 (2019), and major distributions followed by 2022 (RHEL 9). v2 is a unified hierarchy with one consistent controller model, replacing v1's per-controller hierarchies. On systemd hosts, runtimes ask systemd for a transient scope or slice over D-Bus and let systemd create the cgroup with Delegate=yes; the runtime itself never writes arbitrary paths under /sys/fs/cgroup.
The split to hold onto:
- namespaces answer what world does this process see?
- cgroups answer what resources can this process group use?
The Root Filesystem
Inside a container, / is almost never the host's /. It is a prepared root filesystem assembled from image layers and runtime mounts.
Producing it follows a fixed sequence: layers arrive as content; the snapshotter unpacks them and stacks them, typically with overlayfs; a writable upper layer is added for this container; the runtime sets up the mount table inside a fresh mount namespace; and pivot_root(2) swaps the prepared tree in as the new /. Bind mounts then expose specific host paths, tmpfs is mounted where needed, certain paths are masked or read-only, and device nodes are restricted.
The image is the repeatable base; the runtime decides what host-specific mounts, secrets, devices, and writable paths show up.
Security Controls: What The Process May Do
In Unix, visibility and permission are different problems. A container process can see a path it cannot read, run as UID 0 inside a user namespace while mapping to an unprivileged UID on the host, hold a shrunken set of capabilities, and have most of its syscalls denied by seccomp and most of its file accesses denied by AppArmor or SELinux.
The usual controls:
- capabilities — split root-equivalent authority into individually grantable pieces;
- seccomp — filter or block system calls;
- AppArmor / SELinux — apply mandatory access control policy;
- user namespaces — map UIDs and GIDs;
- read-only and masked mounts — limit filesystem exposure;
- device cgroup rules — control which device nodes work.
The boundary is the combined configuration. Dropping CAP_NET_ADMIN is fine for a web server but breaks a CNI plugin that needs to manage routes inside the netns; turning off seccomp's default deny list lets a workload call keyctl(2) again, which is fine for some images and a privilege-escalation vector for others.
Networking: How The Process Communicates
A container usually runs in its own network namespace, with its own loopback device, addresses, routes, firewall state, and ports. Connecting it to anything else is left to the runtime: typical solutions include veth pairs into a bridge, routed setups, NAT, overlay networks, eBPF datapaths, or direct device assignment.
Kubernetes treats this slightly differently. It creates one network namespace per pod sandbox, and every container in that pod runs inside the same namespace. This is why containers in a pod can talk to each other over localhost — they share the namespace.
The Container Network Interface (CNI) fits at this layer. CNI is a plugin specification, not a container or a runtime. The runtime creates (or receives) a network namespace, then invokes plugin binaries with ADD to set it up and DEL to tear it down. A concrete ADD call: containerd executes /opt/cni/bin/bridge, passes the namespace path in CNI_NETNS and a JSON config on stdin, and the bridge plugin creates a veth pair, moves one end into the namespace, attaches the other to a Linux bridge on the host, and delegates address allocation to an IPAM plugin. Part V walks through this in detail.
The Metadata Outside
The running process is only half the story. Outside it, the runtime tracks the image reference, labels and annotations, the snapshot key, the runtime name, the OCI spec, task status, the shim process, IO pipes and logs, exit status, and the leases that keep garbage collection from reclaiming content still in use.
This outside state is why containerd separates a container from a task. The container object is metadata that can exist without any process; the task is the live execution of that metadata.
A Thought Experiment
Take /bin/sh on a Linux host. Run it normally; it sees the host's process tree, mounts, network, and resource environment. Now change one thing at a time. The commands below add isolation in roughly the order a runtime does. Run them on a disposable Linux VM as root — each one mutates kernel state — and refer to chapter 4 onward for the flag-by-flag detail.
Swap the root filesystem (mount namespace plus pivot_root): / means something different.
# Build a tiny rootfs and enter a shell whose / is that directory.
mkdir -p /tmp/rootfs && docker export $(docker create alpine:3.20) | tar -x -C /tmp/rootfs
sudo unshare --mount --uts --pid --fork --mount-proc=/tmp/rootfs/proc \
chroot /tmp/rootfs /bin/sh
# Inside: ls / shows alpine's layout, not the host's.
Add a UTS namespace: it gets its own hostname.
sudo unshare --uts -- /bin/sh -c 'hostname container-demo; hostname; exit'
hostname # unchanged on the host
Add a PID namespace: it sees a process tree where it might be PID 1.
sudo unshare --pid --fork --mount-proc -- /bin/sh -c 'echo "I am PID $$"; ps -ef'
# I am PID 1
# UID PID PPID ... CMD
# 0 1 0 ... /bin/sh -c echo "I am PID $$"; ps -ef
Add a cgroup with limits: its memory and CPU are bounded and accounted.
sudo mkdir /sys/fs/cgroup/demo
echo "100M" | sudo tee /sys/fs/cgroup/demo/memory.max
echo "50000 100000" | sudo tee /sys/fs/cgroup/demo/cpu.max # 50% of one CPU
sudo unshare --pid --fork --mount-proc -- /bin/sh -c '
echo $$ > /sys/fs/cgroup/demo/cgroup.procs
cat /sys/fs/cgroup/demo/memory.current
'
sudo rmdir /sys/fs/cgroup/demo # cleanup
Add a network namespace and a veth pair: it has its own network stack, plumbed to the host.
sudo ip netns add demo
sudo ip link add veth-h type veth peer name veth-c
sudo ip link set veth-c netns demo
sudo ip addr add 10.20.0.1/24 dev veth-h && sudo ip link set veth-h up
sudo ip -n demo addr add 10.20.0.2/24 dev veth-c
sudo ip -n demo link set veth-c up && sudo ip -n demo link set lo up
sudo ip netns exec demo /bin/sh -c 'ip addr; ping -c1 10.20.0.1'
sudo ip netns del demo && sudo ip link del veth-h 2>/dev/null
Drop capabilities, attach a seccomp profile, and apply an LSM policy: it has less authority even in a tree it appears to own.
# Capabilities: run /bin/sh with no caps in any set.
sudo capsh --drop=all --caps="" -- -c 'grep ^Cap /proc/self/status; id'
# Seccomp: deny mkdir(2) for this shell only (requires runc/docker for full profiles).
docker run --rm --security-opt seccomp=<(echo '{
"defaultAction":"SCMP_ACT_ALLOW",
"syscalls":[{"names":["mkdir","mkdirat"],"action":"SCMP_ACT_ERRNO"}]
}') alpine:3.20 sh -c 'mkdir /tmp/x || echo "mkdir blocked"'
# LSM: confine /bin/sh under the docker-default AppArmor profile.
docker run --rm --security-opt apparmor=docker-default alpine:3.20 \
sh -c 'cat /proc/self/attr/current'
# docker-default (enforce)
Each command on its own is a small kernel call. Stacked, they are what a runtime hands the kernel when it starts a container.
A Working Definition
The word container gets attached to all of these: a tarball, an image in a registry, a chroot, a single namespace or cgroup, a docker run command, a Kubernetes pod, even a VM. Each names a real part of the picture, and none of them is the whole thing. Mistaking the part for the whole is how the word loses meaning.
A container is a process or process tree, started from packaged content and a runtime spec, isolated and constrained by host kernel primitives, and tracked by metadata that lives outside it.
Chapter 3 picks up the contracts that hold these pieces together: the OCI runtime spec and the runtime v2 shim API.
Sources And Further Reading
- Linux namespaces: https://man7.org/linux/man-pages/man7/namespaces.7.html
- Linux cgroups: https://man7.org/linux/man-pages/man7/cgroups.7.html
- Linux cgroup v2 docs: https://kernel.org/doc/html/next/admin-guide/cgroup-v2.html
- OCI Runtime Specification: https://github.com/opencontainers/runtime-spec
- OCI Image Specification: https://github.com/opencontainers/image-spec
- CNI specification: https://www.cni.dev/docs/spec/