Chapter 10: containerd Architecture

containerd is a daemon that hosts a plugin graph: content store, snapshotters, metadata, runtime v2, CRI, transfer, events, and services are all plugins, and clients reach each of them through a gRPC API.

To do that job it has to keep several kinds of state coordinated without collapsing them into one object: image references and descriptors, immutable content blobs, unpacked filesystem snapshots, persistent container metadata, live tasks, runtime shims, leases that protect work from garbage collection, and events that report what changed.

Those nouns carry weight for the rest of Part IV. A containerd container is a metadata object. A task is live execution. A snapshot is filesystem state managed by a snapshotter. Content is digest-addressed bytes, not a mounted root filesystem.

flowchart TB daemon[containerd daemon] --> plugins[Plugin graph] plugins --> metadata[Metadata plugin] metadata --> content[Content store] metadata --> snapshots[Snapshotters] metadata --> containers[Container store] plugins --> runtimev2[Runtime v2 task manager] runtimev2 --> shims[Runtime shims] plugins --> cri[CRI plugin] cri --> kubelet[kubelet]

The Daemon As Plugin Host

The containerd daemon loads each configured plugin at startup, passing in a context with the root and state directories, daemon addresses, and the results of plugins it depends on. The startup loop in cmd/containerd/server/server.go is three lines of substance:

result := p.Init(initContext)
instance, err := result.Instance()
s.plugins = append(s.plugins, result)

Required plugins are tracked, so a missing core dependency fails startup instead of producing a half-wired daemon. Content, snapshotters, metadata, runtime v2, CRI, transfer, events, and services all show up as nodes in this graph.

The metadata plugin is the best anchor. It depends on content, events, and snapshots; at startup it opens meta.db, collects registered snapshotters, and builds a metadata database over the content store and snapshotter map. Images, containers, snapshots, and content can then be reasoned about together without being forced into one physical backend.

The metadata database stores names, records, labels, relationships, and namespace-scoped state. The content store stores blobs. Snapshotters own filesystem snapshot state. On disk under /var/lib/containerd/, the three stores live side by side:

io.containerd.metadata.v1.bolt/meta.db
io.containerd.content.v1.content/blobs/sha256/
io.containerd.snapshotter.v1.overlayfs/

Namespaces Are Metadata Partitions

containerd namespaces are not Linux namespaces. They do not call unshare(2) or clone(2) and they do not create process, mount, or network namespaces. They partition containerd's own metadata so one daemon can serve multiple consumers without mixing images, containers, leases, and snapshots.

CRI uses the k8s.io namespace. ctr defaults to default. Other clients pick their own. A single daemon can hold Kubernetes-managed objects and ctr-created objects at the same time, with each client seeing only the namespace it asks for.

A Linux PID namespace changes what processes can see. A containerd namespace changes where records are stored and looked up inside containerd. The two travel together by convention but not by mechanism: a process can run with no new Linux namespace and still belong to a containerd namespace, and a process can run inside many Linux namespaces while its metadata lives under k8s.io.

The Smart Client Model

containerd deliberately leaves higher-level work to clients. The plugin documentation calls this a smart-client model: if a step does not need to live in the daemon, the client does it before asking a service to do anything.

ctr, nerdctl, Docker, and CRI each resolve image names, choose snapshotters, build an OCI spec, attach labels, select runtime options, and pick a namespace before calling containerd. The same daemon serves all of them; it never sees how they differ.

The CRI plugin is a special case in that it runs inside the daemon, but it follows the same model: it translates Kubernetes CRI calls into containerd service calls and runtime choices. CRI is not a second runtime below containerd.

Core Services

The useful axis for learning containerd services is lifecycle responsibility:

Service Responsibility
Content Store immutable blobs and active ingestions by digest.
Images Map names to OCI descriptor targets and keep image metadata.
Snapshots Create active, view, and committed filesystem snapshots.
Containers Store persistent container metadata: spec, image, runtime, snapshot key, labels, extensions.
Tasks Manage live execution through runtime plugins and shims.
Leases Protect content, snapshots, and metadata while work is in progress.
Events Publish daemon and runtime lifecycle events.

The rows compose: image metadata points at content; unpacking content creates snapshots; a container record points at an image and a snapshot key; a task uses the container record to start live execution through runtime v2; a lease holds intermediate objects together while a pull or unpack is in flight; events tell observers what happened after requests return.

Keeping those responsibilities separate is what lets a single daemon serve several modes at once. An image can be pulled but not unpacked. A container can exist with no running task. A task can exit while the container metadata stays. A snapshot can outlive the image name that originally produced it, because a container or lease still references the chain.

Runtime v2 In The Architecture

Runtime v2 is the boundary between containerd's task service and runtime-specific process supervision. containerd starts or reconnects to a shim and talks to that shim over the runtime v2 task API. The common Linux path is io.containerd.runc.v2, implemented by the containerd-shim-runc-v2 binary, which in turn drives runc.

The daemon never embeds the details of any specific runtime. The shim owns the runtime-specific work — invoking the OCI runtime, tracking exits, handling IO, publishing task events. That is why containerd can be restarted while existing tasks keep running under their shims, and why a different runtime can be slotted in by configuration alone.

The same boundary is why containerd has both synchronous calls and asynchronous events. A client calls Start and gets a response immediately, but the task's later exit arrives as an event. Runtime v2 defines task create, start, exit, delete, pause, resume, checkpoint, OOM, and exec events with ordering guarantees. A caller that only reads request responses misses task exits, OOM events, and any state change that happens after the synchronous call returns.

Where CRI Fits

The CRI plugin is a gRPC plugin inside containerd. It registers Kubernetes RuntimeService and ImageService servers on containerd's gRPC server and maps kubelet requests onto containerd services.

Kubelet gets a Kubernetes-shaped API and never has to know about containerd's image store, snapshotters, task service, leases, event monitor, or runtime v2 shims. It asks for a pod sandbox or a container start. The CRI plugin turns that into namespace-scoped containerd operations under k8s.io.

Four contracts stack on top of each other in the runtime path:

  1. CRI sits between kubelet and containerd.
  2. containerd services sit inside the daemon's own API surface.
  3. Runtime v2 sits between containerd and the shim.
  4. The OCI runtime spec sits between the shim and a runtime such as runc.

Each layer has its own vocabulary, and a word from one layer rarely means the same thing in another. The CRI runtime vs OCI runtime distinction from chapter 1 is the standing example.

Where This Goes

The rest of Part IV walks one object at a time across this graph. Chapter 11 takes image bytes from a reference into the content store and out through a snapshotter. Chapter 12 turns a container record into a task and a shim. Chapter 13 enters from the kubelet side and lands in the same containerd services.

Sources And Further Reading