Chapter 1: The Container Stack Map

The word "container" gets used at every layer of the stack. A docker run command is a "container." A Kubernetes Pod "contains" containers. runc "creates a container." A clone3(2) call with namespace flags creates something the kernel does not call a container at all. The same word names a developer workflow, a Kubernetes resource, a runtime command, and a kernel construction — and people use the meanings interchangeably. That is how the same conversation can be technically correct on every line and still leave the room confused.

This book walks the runtime stack from the kernel up: namespaces and cgroups (control groups) and mount setup, the OCI (Open Container Initiative) runtime spec and runc, the runtime v2 shim, containerd's content store and snapshotters and task service, the Container Runtime Interface (CRI), and the Kubernetes objects that finally land on those primitives. The map below is what the rest of the chapters fill in.

A container system turns a request — typed by a developer or issued by an orchestrator — into a Linux process running with a configured environment. Along the way image references become local content, image layers become a mountable filesystem, runtime metadata becomes an OCI spec, and that spec becomes kernel state: namespaces, cgroups, mounts, security policy, and a network stack.

flowchart TB request[User or orchestrator request] --> frontend[Developer tool or kubelet] frontend --> runtimeapi[Runtime API boundary] runtimeapi --> containerd[containerd] containerd --> content[Content and image services] containerd --> snapshots[Snapshotters] containerd --> task[Task service] task --> shim[Runtime v2 shim] shim --> bundle[OCI runtime bundle] shim --> runc[runc or another OCI runtime] runc --> kernel[Linux kernel] kernel --> ns[Namespaces] kernel --> cg[cgroups] kernel --> fs[Mounts and filesystems] kernel --> sec[Capabilities, seccomp, LSMs] kernel --> net[Network stack]

Docker, Kubernetes, nerdctl, ctr, Podman, and direct containerd clients all enter the stack at different layers. What stays consistent is the delegation pattern: high-level tools express intent, lower layers produce content and runtime state, and the kernel enforces the result.

A Short History

Containers were not designed; they accreted over four decades of Unix work. The modern stack — kernel primitives at the bottom, an OCI runtime, a per-container shim, a daemon, an orchestrator API, and developer tooling at the top — is a record of which problem each generation solved and which it deferred to the next.

Filesystem Isolation: 1979 To 2005

Unix chroot(2) shipped in V7 in 1979. It changes the apparent root directory of a process so pathname resolution starts somewhere other than /. That is a useful primitive — it lets a sandbox or a build environment see a smaller filesystem — and it is not a container. chroot does not isolate process IDs, networking, IPC, users, or privileges. Root inside a chroot is root on the host, and a root process can break out of one without much effort.

FreeBSD jails (FreeBSD 4.0, March 2000) extended the idea into something closer to OS-level virtualization. A jail is chroot plus its own hostname, IP-binding constraints, restricted process visibility, and a privilege model that downgrades a jailed root. Solaris Zones (Solaris 10, February 2005) went further: separate process trees, separate networking stacks (with "exclusive-IP" zones added in 2007), and separate user spaces, all sharing one kernel. Both systems demonstrated, several years before Linux containers were practical, that "many isolated user spaces on one kernel" was a real production model.

Linux took longer. Through the 2000s the kernel grew the same set of capabilities one feature at a time, and the userland followed.

The Linux Kernel Catches Up: 2002 To 2020

Linux containers did not arrive as a single feature. The kernel grew the parts list across more than a decade:

Feature	Kernel	Year
POSIX capabilities	2.2	1999
Mount namespace	2.4.19	2002
SELinux	2.6.0	2003
seccomp (mode 1)	2.6.12	2005
UTS and IPC namespaces	2.6.19	2006
Process containers (cgroups)	2.6.24	2008
PID namespace	2.6.24	2008
Network namespace	2.6.24 onward	2008–2009
AppArmor	2.6.36	2010
seccomp-bpf	3.5	2012
User namespace	3.8	2013
OverlayFS	3.18	2014
cgroup v2	4.5	2016
cgroup namespace	4.6	2016
Time namespace	5.6	2020

A short gloss on each name in the table:

POSIX capabilities — root-equivalent authority broken into ~40 individually grantable units (CAP_NET_ADMIN, CAP_SYS_ADMIN, etc.).
Namespaces — kernel mechanism that gives a process a private view of one global resource (process IDs, mounts, network stack, hostname, IPC, user IDs, cgroup tree, time offsets).
cgroups (control groups) — group processes into a hierarchy and account for or limit their CPU, memory, IO, and pids consumption.
SELinux and AppArmor — Linux Security Modules (LSMs) that enforce mandatory access control. SELinux is label-based; AppArmor is path-based.
seccomp — secure computing mode. The kernel feature that filters syscalls; seccomp-bpf lets userspace install a Berkeley Packet Filter program for the syscall decision.
OverlayFS — kernel filesystem that stacks read-only layers under a writable upper layer; the standard backend for layered container images.

cgroups arrived under the name "process containers" — Google's Paul Menage and Rohit Seth proposed them in 2006, and they merged in 2.6.24 in 2008. The name was changed to avoid colliding with the rest of the kernel's "container" usage. (The collision lost anyway.)

Through this period, Linux had container parts but no consensus container. OpenVZ — an open-source OS-level virtualization system distributed as a separate kernel patch set — had been running production VPS hosting since 2005, but never made it to mainline. Google had been running Borg, its internal cluster manager, on cgroups since the late 2000s. Neither was something a developer could install on a laptop.

LXC, Docker, And libcontainer: 2008 To 2014

LXC — short for Linux Containers — was the first userspace toolkit to assemble Linux's accumulating namespace and cgroup features into a usable container model. LXC 0.1.0 shipped in August 2008. It drove the kernel directly, exposed a CLI (lxc-create, lxc-start, lxc-attach), and worked — but it left the workflow problems open. There was no standard image format, no layered build model, no registry, no portable way to "ship a container."

Docker (March 2013) closed the workflow gap. The original 2013 release used LXC underneath; what Docker added was the part above the kernel: a daemon (dockerd), a developer-facing CLI (docker), a layered image format with build instructions (Dockerfile), and a registry protocol (Docker Registry, later Distribution). Containers became commodity infrastructure within eighteen months of Docker's launch — not because the kernel changed, but because the workflow finally fit a developer's day.

Docker 0.9 (March 2014) replaced LXC with libcontainer, a Go library that drove the kernel directly through namespace and cgroup syscalls. The motivations were operational: independence from a separate userland project, the ability to manage namespaces and cgroups directly from Go, and a path to platform-independent execution drivers. libcontainer is the codebase that became runc.

CoreOS launched rkt (pronounced "rocket") in November 2014 with a different model: no central daemon, an appc (App Container) image format, and a process-per-invocation supervision shape. rkt did not survive as a runtime, but its existence pushed the ecosystem toward open, multi-vendor standards instead of Docker-defined ones.

Standards: OCI, 2015

The Open Container Initiative formed at DockerCon on June 22, 2015, under the Linux Foundation. The mandate was open specifications for container formats and runtimes; Docker donated runc (a brand-new extraction of libcontainer) as the reference implementation. OCI publishes three specifications, each versioned independently:

Runtime Specification — what an OCI runtime consumes and how it behaves. Defines the OCI bundle format and the lifecycle commands.
Image Specification — how images are packaged: manifests, image configs, layers, and digests.
Distribution Specification — how images move through registries. Reached 1.0 in 2020 by formalizing Docker Registry HTTP API V2 as an open standard.

The three specs separate concerns the Docker daemon had bundled together: building, distributing, and running an image are now three contracts owned by three standards.

runc 1.0 took until June 2021. Real production deployments had been using pre-1.0 runc for years; the version number was an acknowledgement of stability, not a moment of arrival.

containerd Becomes Its Own Layer: 2015 To 2020

containerd started inside Docker in early 2015 as a refactor that pulled lifecycle and supervision out of dockerd and into a separate daemon. By December 2015 it was a public project; in March 2017 Docker donated it to the Cloud Native Computing Foundation (CNCF), the Linux Foundation sub-project that hosts Kubernetes; CNCF announced graduation on February 28, 2019.

The split mattered because Kubernetes — by then the dominant orchestrator — did not want to depend on Docker the product. containerd 1.0 (December 2017) gave Kubernetes a daemon focused on the work in the middle: pulling and storing image content, snapshot management, container metadata, task supervision, and runtime integration through shims. Docker remained the developer-facing product on top; containerd became reusable infrastructure underneath.

containerd's runtime v2 shim model (containerd 1.2, October 2018) is the boundary that lets a single daemon work with runc, crun (a C reimplementation of runc by Red Hat), gVisor (a Google userspace kernel that intercepts syscalls), Kata Containers (each container in a lightweight VM), Wasm (WebAssembly) runtimes, and Windows host process containers through one ttrpc API. ttrpc is a smaller-footprint variant of gRPC for local Unix-socket transport, designed for the per-container shim use case. Runtime v1 was deprecated in containerd 1.4 (September 2020); current installations are all v2.

Kubernetes And CRI: 2016 To 2022

Kubernetes shipped in 2014 carrying the vocabulary of Borg, Google's internal cluster manager that Kubernetes was modelled after — pods, controllers, schedulers — and an in-tree integration with Docker Engine. By late 2016 the in-tree integration had become a maintenance problem: every container-runtime change required a Kubernetes patch. The Container Runtime Interface (CRI) was the answer. CRI v1alpha shipped in Kubernetes 1.5 (December 2016) as a gRPC API kubelet calls, and any runtime that implements it can plug into Kubernetes.

CRI is also where the word "runtime" splits in two. From Kubernetes' point of view, a "container runtime" is anything that implements CRI: containerd (via its CRI plugin), CRI-O. From OCI's point of view, the "runtime" is what consumes an OCI bundle: runc, crun, youki, runsc, kata-runtime. Both meanings are correct; the rest of the book uses the precise term.

For several years kubelet talked to Docker through an in-tree adapter called dockershim, which presented Docker Engine as if it were a CRI runtime. dockershim was deprecated in Kubernetes 1.20 (December 2020) and removed in 1.24 (May 3, 2022). The change did not break Docker-built images — those are OCI images, identical to what any other tool produces — but it ended the special case where Kubernetes called Docker. Kubernetes nodes now talk to containerd or CRI-O directly.

Why The Stack Looks Like This

The current shape — kernel primitives, OCI runtime, runtime shim, containerd, CRI plugin, kubelet, Kubernetes API — is the residue of those decisions. Each layer marks a boundary that someone, at some point, had to redraw because the layer above wanted to be replaceable.

The kernel stayed the kernel. Containers compose its primitives without changing them.
OCI exists because Docker-defined formats were not the stable contract a multi-vendor ecosystem needed.
runc exists because the kernel-driving code had to live in a small, language-agnostic binary that did not bundle a registry or a daemon.
The runtime v2 shim exists because containerd needs to survive its own restarts without killing running workloads.
containerd exists because Docker's product surface and Kubernetes' runtime needs are different problems.
CRI exists because Kubernetes wanted to stop tracking individual runtime APIs.

That layering is what the rest of the book inspects. Most of the time the layers are invisible — a developer types kubectl apply -f deploy.yaml and a process starts somewhere — but when something breaks, "where in the stack" is the only question that matters.

"Runtime" Means Two Things

The CRI/OCI split appears at every layer of the stack from chapter 3 onward. It deserves its own definition table.

Phrase	Meaning	Examples
CRI runtime	A service kubelet calls over gRPC. Implements `RuntimeService` and `ImageService`.	containerd (via its CRI plugin), CRI-O
OCI runtime	A program that consumes an OCI bundle and produces a configured process.	runc, crun, youki, runsc, kata-runtime
containerd runtime plugin	containerd's configured runtime path; usually a runtime v2 shim binary.	`io.containerd.runc.v2`, `io.containerd.runhcs.v1`
Runtime v2 shim	The supervisor process between containerd and the OCI runtime.	`containerd-shim-runc-v2`, `containerd-shim-kata-v2`

A request from a Kubernetes pod walks all four. kubelet asks the CRI runtime (containerd), which delegates to a runtime v2 shim (containerd-shim-runc-v2), which calls the OCI runtime (runc).

The Layers, From The Top

Intent

A person types docker run nginx, or nerdctl run nginx, or podman run nginx, or applies a Kubernetes Deployment that eventually causes kubelet to ask for a pod. The interfaces differ; the intent is the same — run a process from an image with a particular filesystem, environment, network, and resource policy.

Developer-facing tools name containers, expose ports, attach logs, and hide the runtime machinery. Hiding it does not eliminate it; it means the tool is doing translation. docker run nginx becomes a registry pull, a snapshot prepare, a container record, a task create, a shim launch, a runc invocation, and a clone3(2) call. Everything from the second word onward is what the rest of the book is about.

Docker

Docker is a developer-facing product: a CLI (docker), a daemon (dockerd), and a set of workflows for images, containers, networks, and volumes. Since Docker 1.11 (April 2016), dockerd has not done its own container lifecycle work — it delegates to containerd, and containerd delegates to runc. A docker run command and "a container runtime" are two different things at two different layers; the daemon is the workflow tier, the runtime is what configures the kernel.

Docker is one entry point into the stack. nerdctl is a Docker-compatible CLI for containerd. Podman is a daemonless Docker-compatible CLI that drives runc directly through conmon (container monitor), a small per-container supervisor that plays the same role as containerd's shim. The entry points differ; the layers underneath are the same.

Kubernetes

Kubernetes does not exist to run a single container conveniently. It exists to manage desired state across many machines. A user declares that a workload should exist; controllers and the scheduler decide where it runs; kubelet on each node makes local runtime calls to create pods and containers.

kubelet talks to the runtime over CRI, a gRPC API defined in k8s.io/cri-api. The methods break into a RuntimeService (pod sandbox lifecycle, container lifecycle, exec, attach, port-forward, status, stats, logs) and an ImageService (list, status, pull, remove, filesystem usage). Anything kubelet wants the runtime to do crosses one of those two interfaces.

A pod is a Kubernetes-only concept; CRI invents a "pod sandbox" to represent it at the runtime layer. The sandbox holds the network namespace, runtime endpoint, labels, and (on Linux) a pause container that pins the namespaces while the pod's workload containers come and go. Chapter 13 walks the full sandbox lifecycle.

containerd

containerd is where image references become content, content becomes a snapshot, and a snapshot plus a spec becomes a running task — three boundaries the rest of the book follows.

It is a daemon with a graph of plugins: a content store for digest-addressed bytes, snapshotters for filesystem state, an image service for name-to-descriptor mapping, a container metadata store, a task service, runtime v2, a CRI plugin, an events service, and a leases service for protecting in-flight work from garbage collection. Clients reach each plugin through gRPC.

containerd also enforces a distinction that cuts through the rest of the book's vocabulary: a container is metadata — image, labels, snapshot key, runtime config, OCI spec — while a task is the live process derived from that metadata. A container can exist without a task. A task can exit while the container record stays. Untangling them is the first step toward understanding what the daemon is actually doing.

Images, Content, And Snapshots

When the user says "run nginx," nothing about that reference is yet usable. The reference is a name; the kernel needs a filesystem.

Producing one takes six steps:

resolve the reference to a manifest;
fetch the manifest, image config, and layer blobs from a registry;
store blobs by digest in the content store;
unpack the layers;
ask a snapshotter to prepare a mountable view, plus a writable layer for this container;
mount that view as the root for the process.

The content store and the snapshotter are deliberately separate concerns. The content store holds digest-addressed image data, immutable and shareable across containers and across images. The snapshotter turns layers into filesystem state. Image pulls can succeed even when unpacking fails; many containers can share one set of immutable layers; and garbage collection has to reason about content, snapshots, containers, and leases at once. Chapter 11 covers the whole pipeline.

Shims

In the runtime v2 model, containerd launches one shim process per container (or per pod sandbox); for the standard Linux runc path, the binary is containerd-shim-runc-v2. The shim owns runtime-specific behavior and the lifecycle of the actual container process: it invokes runc, manages the IO pipes, reaps the process, and reports the exit status back to containerd. On a host running one container, pstree shows the layering:

systemd
├── containerd
└── containerd-shim-runc-v2
    └── nginx        ← the workload

The shim is not a duplicate of containerd. It is the supervision boundary that lets containerd be restarted without killing running workloads — the shim keeps holding the workload's stdout, exit status, and signal path until the daemon reconnects. Runtime-specific code lives behind the shim v2 task API over ttrpc (a lightweight gRPC variant for local Unix-socket transport), so swapping runc for crun, gVisor, or Kata is a configuration change at the containerd layer, not a code change.

runc

runc is a low-level OCI runtime. Its job is to take an OCI bundle — a directory containing config.json and a root filesystem — and produce a configured process. It does not talk to registries, schedule pods, or host a developer workflow.

config.json describes everything the kernel needs to know: the process to run, environment and working directory, root filesystem path, mounts, namespaces to enter or create, cgroup settings, capabilities, seccomp filter, hooks, and annotations. runc translates that into Linux operations — clone(2) or clone3(2) with the right namespace flags, mount setup, cgroup writes, credential and capability adjustments — and then execves the configured process.

runc is one OCI runtime. crun is a C reimplementation maintained by Red Hat with lower memory and faster startup. youki is a Rust reimplementation. runsc (gVisor) intercepts syscalls in a userspace kernel for stronger isolation. kata-runtime runs each container inside a lightweight VM. All of them satisfy the same OCI bundle contract.

The Kernel

A container is built by configuring several Linux primitives at once:

namespaces change what a process can see;
cgroups account for and limit what it can use;
mount setup determines what filesystem it sees as /;
capabilities and seccomp reduce its authority;
LSMs (AppArmor, SELinux) enforce mandatory access control;
network namespaces and virtual devices route its traffic.

The kernel does not have a "container" type. It runs processes with credentials, mount tables, and namespace memberships — and the runtime configures those primitives so that, taken together, they behave like the abstraction the layers above promised.

What "Container" Means At Each Layer

The same word names different objects at different layers. When two engineers disagree about containers, they are usually pointing at different rows of this table:

Layer	"A container" is...	The actual object
Kubernetes	An entry in a `Pod` spec.	A `Container` inside `PodSpec.containers`.
CRI	A runtime-level container, scoped to a pod sandbox.	A container ID returned by `RuntimeService.CreateContainer`.
containerd	Persistent metadata.	A row in the container metadata store: ID, image, snapshot key, runtime, OCI spec.
Runtime v2 / shim	A supervised task.	A task service ID with an attached shim process and runc state.
OCI	A bundle in the `created` or `running` state.	A directory with `config.json` plus a runtime state file.
Linux kernel	Nothing.	A process tree with namespace memberships, cgroup placement, mount table, credentials, and security policy.

Most "is X a container" arguments resolve once both speakers agree which row they mean.

Following One Request Down

sequenceDiagram participant U as User or kubelet participant C as containerd participant S as Snapshotter participant H as Shim participant R as runc participant K as Linux kernel U->>C: create container from image and config C->>C: resolve and store image content C->>S: prepare filesystem snapshot U->>C: start task C->>H: launch runtime shim H->>R: create/start OCI bundle R->>K: configure namespaces, cgroups, mounts, security K-->>R: process starts in configured environment

The exact calls vary by client, runtime configuration, and host. The rest of Part I expands two of the diagram's nouns: chapter 2 zooms in on the kernel-side question — what a container actually is — and chapter 3 covers the contracts that hold the stack together: the OCI runtime spec and the runtime v2 shim API.

Sources And Further Reading

Docker overview: https://docs.docker.com/get-started/docker-overview/
Docker 0.9 / libcontainer announcement: https://www.docker.com/blog/docker-0-9-introducing-execution-drivers-and-libcontainer/
Open Container Initiative: https://opencontainers.org/
OCI Runtime Specification: https://github.com/opencontainers/runtime-spec
OCI Image Specification: https://github.com/opencontainers/image-spec
OCI Distribution Specification: https://github.com/opencontainers/distribution-spec
containerd docs: https://containerd.io/docs/main/
containerd graduation announcement: https://www.cncf.io/announcements/2019/02/28/cncf-announces-containerd-graduation/
containerd runtime v2: https://github.com/containerd/containerd/blob/main/docs/runtime-v2.md
Kubernetes container runtimes: https://kubernetes.io/docs/setup/production-environment/container-runtimes/
Kubernetes CRI introduction: https://kubernetes.io/blog/2016/12/container-runtime-interface-cri-in-kubernetes/
Kubernetes dockershim removal FAQ: https://kubernetes.io/blog/2022/02/17/dockershim-faq/
runc README: https://github.com/opencontainers/runc
LXC project: https://linuxcontainers.org/lxc/introduction/
FreeBSD jail(8): https://man.freebsd.org/cgi/man.cgi?query=jail&sektion=8
Solaris Zones introduction: https://docs.oracle.com/cd/E19044-01/sol.containers/817-1592/zones.intro-1/index.html