Chapter 12: Containers, Tasks, And Shims

In containerd, a container is not a running process. A container is persistent metadata; a task is live execution; a shim is the process boundary that lets containerd supervise that task without becoming the workload's direct parent.

That single distinction is why ctr container ls can list a container while ctr task ls shows nothing for it, and why ctr task rm leaves the container record in place. A container can exist with no task. A task can exit while the container record remains. Deleting a task is not the same operation as deleting the container metadata, and starting a task is not the same operation as creating the container record.

Word containerd meaning Concrete state
Container Persistent metadata ID, OCI spec, image, snapshot key, runtime, labels, extensions, sandbox ID
Task Live execution PID, status, IO, exec processes, runtime operations
Shim Runtime supervisor Task service endpoint, runtime calls, IO, exit handling, event publishing

The common Linux path is:

flowchart LR metadata[Container metadata] --> taskreq[Task create request] taskreq --> bundle[Runtime v2 bundle] bundle --> shim[containerd-shim-runc-v2] shim --> runc[runc] runc --> kernel[Linux kernel]

Container Metadata

A container record is the durable intent containerd needs later: runtime choice, image name, OCI spec, snapshotter, snapshot key, labels, extensions, and an optional sandbox ID. It is not an init process and it is not a cgroup.

The client-side Container interface reinforces the split. The metadata operations — Info, Spec, Image, Update, Delete — all stay on the metadata side; only NewTask crosses over into live execution. That is why ctr container create can return success while nothing is running on the host: the record exists, the process does not.

Task Creation

Container.NewTask turns a container record into a task service request. It creates or attaches IO, resolves snapshot mounts if the container has a snapshot key, carries runtime options across, and sends a CreateTaskRequest to the task service.

Strip the supporting code in client/container.go away and the call is two lines:

request := &tasks.CreateTaskRequest{ContainerID: c.id}
response, err := c.client.TaskService().Create(ctx, request)

The surrounding code builds IO, reads the container spec, asks the snapshotter for mounts, and fills runtime options before the request goes out — but a container record does not become a task until a client calls NewTask.

Starting is a separate boundary again. The client Task interface offers Start, Kill, Pause, Resume, Exec, Pids, Checkpoint, Update, Metrics, Spec, Wait, and Delete. Create prepares runtime state. Start runs the process. The split lets a runtime build a container's disk and IO state, optionally take a checkpoint, and only then begin execution.

Runtime v2 Bundles

Before a shim starts, the runtime v2 task manager builds a bundle directory on disk. A bundle is not an image layer and not part of the content store; it is per-task runtime state, scoped to one container.

NewBundle validates the task ID, creates the namespace-scoped state and work directories, creates a rootfs directory inside the bundle, links the work directory back in, and writes config.json when the request carries an OCI spec. The snapshot mounts from chapter 11 are activated into this same path.

That bundle is the handoff point between containerd metadata and a real OCI runtime. The shim is given enough information to ask runc to create the container without ever consulting the containerd database.

Starting Or Reconnecting To A Shim

The runtime v2 task manager creates the bundle, activates mounts, hands off to the shim manager to start or reconnect to a shim, validates the shim's runtime features, and calls the shim task client's Create.

Runtime names look like Java package paths: io.containerd.runc.v2 is the canonical Linux/runc handler. The shim manager translates that to the binary name containerd-shim-runc-v2, finds it on PATH, executes it with the start action, parses the bootstrap address out of the response, connects over ttrpc (or gRPC, for shims that opt in), and persists the bootstrap data so containerd can reconnect to the same shim later.

This is the daemon-restart survival case from chapter 3, made concrete: the shim keeps supervising the workload across a containerd restart and reattaches when the daemon comes back.

The startup sequence looks like this:

sequenceDiagram participant client as client or CRI participant cd as containerd participant tm as runtime v2 task manager participant shim as containerd-shim-runc-v2 participant runc as runc client->>cd: create container metadata client->>cd: create task cd->>tm: Create(taskID, spec, rootfs) tm->>tm: write bundle/config.json tm->>shim: exec shim start shim-->>tm: bootstrap address tm->>shim: TaskService.Create shim->>runc: runc create client->>cd: start task cd->>shim: TaskService.Start shim->>runc: runc start

Shim Grouping

"One shim per container" is a useful first approximation and not a rule. containerd 2.3 ships shim API version 3, and when a task belongs to a sandbox whose endpoint is reachable, the shim manager connects to the existing sandbox shim instead of starting another binary. If the endpoint is missing or its API version is older than the sandbox protocol, it falls back to launching a fresh shim.

The CRI sandbox path is the most visible place this matters: every container in a pod can share a single sandbox shim. What stays constant is that the shim is the runtime supervision boundary; how many shims there are depends on the runtime and sandbox configuration.

The runc Shim

containerd-shim-runc-v2 is the runtime v2 task service for the standard Linux runc path. Its service object tracks containers and processes, watches OOM events, reaps exits, and publishes lifecycle events back to containerd.

On Create, the shim runs its runc container creation path: it converts the task request's mounts into process mounts, mounts the rootfs into the bundle's rootfs directory, writes runtime options, constructs the init process, and prepares a runc create invocation that carries the runtime root, bundle path, containerd namespace, runc binary name, and systemd-cgroup setting.

On Start, the shim calls the container's Start method and publishes a task-start or exec-start event. When the process exits, the shim picks the exit up through its wait handling and publishes a task-exit event. containerd never had to be the workload's direct parent to learn it died.

The snapshotter produced mounts in chapter 11; the task manager dropped them into the bundle path; the runc shim mounted the rootfs and handed runc an OCI bundle to create and start.

Events And Waiting

Task lifecycle is not only request and response. Callers can Wait, subscribe to events, or inspect status; runtime v2 shims publish task create, start, exit, delete, pause, resume, OOM, and exec events with defined ordering.

A Start can succeed and the process can exit a millisecond later. A client that only records the response from Start has observed the request, not the lifecycle. CRI, Docker, and nerdctl all rely on the wait and event paths to keep their own state honest.

Where This Goes

Chapter 13 stacks Kubernetes on top of all of that. Kubelet does not call runc — it calls CRI, and the CRI plugin builds sandboxes and containers in containerd that ride the same task-and-shim machinery this chapter just walked through.

Sources And Further Reading