Chapter 3: Standards — OCI and Runtime v2

flowchart TB kubelet[kubelet] -->|CRI gRPC| containerd[containerd] containerd -->|exec + ttrpc Task service| shim[Runtime v2 shim] shim -->|OCI bundle on disk + lifecycle CLI| runtime[OCI runtime: runc, crun, ...] runtime --> kernel[Linux kernel]

Each arrow is a different contract. CRI is owned by Kubernetes. Runtime v2 is owned by containerd. The OCI runtime spec is owned by the OCI. They were designed at different times by different groups.

What OCI Is

The Open Container Initiative is a Linux Foundation project formed at DockerCon on June 22, 2015. Its mandate is to publish open specifications for container formats and runtimes. The point is interoperability: a runtime, a registry, or a builder should be replaceable without rewriting everything above it.

OCI publishes three specifications, each versioned independently:

The Runtime and Image specs were the original two. The Distribution spec, derived from Docker's Registry HTTP API V2, reached 1.0 in 2020 and turned an existing de facto standard into an open one.

This chapter focuses on the Runtime Specification.

The OCI Runtime Specification

The runtime spec defines two artifacts and a small lifecycle.

The Bundle

A bundle is a directory on disk containing two things: a config.json file and a root filesystem the runtime will use as /.

config.json describes the desired environment in fields that map to the chapter 2 model:

ociVersion pins the spec version the config targets; runc rejects bundles whose major version it does not recognize.

A trimmed but realistic config.json — the kind of file runc spec generates and the shim hands runc — looks like this:

{
  "ociVersion": "1.2.0",
  "process": {
    "terminal": false,
    "user": { "uid": 0, "gid": 0 },
    "args": ["/bin/sh"],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding":  ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"],
      "effective": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"],
      "permitted": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"]
    },
    "rlimits": [
      { "type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024 }
    ],
    "noNewPrivileges": true
  },
  "root": { "path": "rootfs", "readonly": true },
  "hostname": "runc",
  "mounts": [
    { "destination": "/proc", "type": "proc", "source": "proc" },
    { "destination": "/dev",  "type": "tmpfs", "source": "tmpfs",
      "options": ["nosuid", "strictatime", "mode=755", "size=65536k"] },
    { "destination": "/sys",  "type": "sysfs", "source": "sysfs",
      "options": ["nosuid", "noexec", "nodev", "ro"] }
  ],
  "linux": {
    "namespaces": [
      { "type": "pid" }, { "type": "network" }, { "type": "ipc" },
      { "type": "uts" }, { "type": "mount" }
    ],
    "uidMappings": [{ "containerID": 0, "hostID": 100000, "size": 65536 }],
    "gidMappings": [{ "containerID": 0, "hostID": 100000, "size": 65536 }],
    "resources": {
      "memory": { "limit": 268435456 },
      "cpu":    { "shares": 512, "quota": 50000, "period": 100000 },
      "devices": [{ "allow": false, "access": "rwm" }]
    },
    "maskedPaths":  ["/proc/kcore", "/proc/keys", "/sys/firmware"],
    "readonlyPaths": ["/proc/asound", "/proc/bus", "/proc/sys"],
    "seccomp": { "defaultAction": "SCMP_ACT_ERRNO" }
  }
}

Every chapter-2 concept has a slot here: namespaces under linux.namespaces, cgroup limits under linux.resources, the rootfs under root, capabilities and noNewPrivileges under process, mounts and masked paths in their own sections. Generate a fully-populated default in any directory with runc spec, then edit it down — that is the path most runtime work takes.

The Lifecycle

A compliant OCI runtime exposes five lifecycle commands:

create and start are split so the caller can do work between resource setup and process execution: attach IO, set up a network namespace from outside, run hooks. That is also why containerd's shim uses the two-phase form rather than runc run.

Hooks

Hooks let external code run at fixed points in the container's lifecycle. The current spec defines five:

There is also a prestart hook that was deprecated in spec 1.0.2 in favor of the more granular hooks above; runtimes still support it for compatibility. Hooks are how networking, GPU device injection, and a lot of runtime extension behavior get wired in without modifying the runtime itself.

OCI Runtimes In Practice

An OCI runtime is anything that consumes a valid bundle and implements those lifecycle commands. The interchangeability is the point: the layers above the runtime — containerd, Docker, Kubernetes — pick a runtime by configuration, not by code.

The common implementations:

Containerd's Runtime v2

containerd does not exec the OCI runtime directly. It launches a shim per task, and the shim drives the runtime on its behalf. This is the runtime v2 model, defined in docs/runtime-v2.md in the containerd repo.

Why The Shim Exists

If containerd were the direct parent of every container process, three things would break:

  1. Restarting containerd would kill containers. When the parent dies, the children either die with it or are reparented to init.
  2. IO and exit handling would leak into containerd. Stdin/stdout pipes, exit notification, and signal forwarding are runtime-specific.
  3. Runtime swaps would require daemon changes. With the shim contract, swapping runc for crun is a configuration change.

v1 Versus v2

The original v1 shim was defined inside containerd and predated a stable wire protocol. Runtime v2 is a versioned API: the shim binary speaks a defined ttrpc Task service, and any runtime can be wrapped behind it without touching containerd. v2 also supports a single shim serving multiple containers — typically one shim per pod sandbox in CRI mode — which cuts per-container overhead in Kubernetes.

containerd 1.0 (2017) shipped v1; 1.4 (2020) deprecated it; v2 is what every current installation uses.

Naming

A v2 shim is identified by a name like io.containerd.runc.v2. The binary name is the same string with dots replaced by dashes: containerd-shim-runc-v2. containerd resolves the name to a binary on $PATH when starting a task.

The shims you are likely to see:

Runtime name Binary Purpose
io.containerd.runc.v2 containerd-shim-runc-v2 Standard Linux runc path
io.containerd.runhcs.v1 containerd-shim-runhcs-v1 Windows HCS
Vendor-specific containerd-shim-kata-v2, containerd-shim-runsc-v1, etc. Kata, gVisor, Firecracker

The Wire Protocols

The four arrows in this chapter's opening diagram each use a different transport. Walking from top to bottom:

CRI gRPC: kubelet → containerd. The kubelet calls containerd over a Unix domain socket — /run/containerd/containerd.sock by default — using the RuntimeService and ImageService definitions in cri-api. These are full gRPC over HTTP/2: requests like RunPodSandbox, CreateContainer, and StartContainer are unary, while Attach, Exec, and PortForward use streaming. The kubelet picks the socket up from --container-runtime-endpoint; the path is the only handshake.

ttrpc Task service: containerd → shim. Once a task exists, containerd talks to its shim over ttrpc — a slimmer gRPC variant maintained at github.com/containerd/ttrpc. The protobuf definitions are gRPC-shaped, but the wire skips HTTP/2: framed protobuf over a Unix socket, no streaming, smaller binary, smaller memory. ttrpc fits here because shims are cheap and there is one per workload (or per pod) — a full HTTP/2 stack per shim would cost more than the shim itself. The Task service the shim implements covers the operations a daemon needs:

When the shim is sandbox-aware (i.e. it can host a Kubernetes pod), a separate Sandbox service sits alongside Task.

Events: shim → containerd, via a publish binary. Containers produce asynchronous events — TaskExit, TaskOOM, TaskCreate — that the shim has to push back to containerd. Rather than open a second long-lived ttrpc connection, the shim invokes a small binary it was given at startup. containerd execs the shim with -publish-binary /usr/local/bin/containerd and -address <main socket>; the shim runs containerd publish --topic=/tasks/exit --namespace=<ns> for each event, with a serialized protobuf envelope on stdin. The publish binary is a thin client that forwards the event over ttrpc to containerd's Events service and exits. From the shim's perspective, an event is one fork/exec.

OCI runtime CLI: shim → runc. The shim does not link runc as a library. It writes the bundle to disk and invokes runc as a subprocess for each lifecycle step: runc create, runc start, runc kill, runc delete. The "protocol" is the CLI — arguments and exit codes — with several out-of-band channels:

Part III walks through the runc side of these calls in detail.

The Startup Handshake

containerd starts a shim by execing its binary with the subcommand start and a fixed set of flags: -namespace (a containerd namespace, not a Linux one), -id (the task id), -address (containerd's main ttrpc socket), and -publish-binary (the events client). The shim then:

  1. Creates its own ttrpc socket — a Unix abstract socket under containerd's state directory, e.g. /run/containerd/s/<random>.
  2. Forks itself; the child runs the ttrpc server, the parent prints the socket address on stdout and exits.
  3. containerd reads the address, dials it, and from then on drives the Task service over ttrpc.

The shim process — not containerd — is the parent of the container's init process. Containerd holds a ttrpc connection to the shim, the shim holds the runc invocations and the init PID, and runc has already exited by the time Start returns. If containerd restarts, the shim keeps running and keeps holding the container; on reconnect, containerd dials the existing socket and resumes management without restarting the workload.

Part III opens up the OCI runtime spec and runc. Part IV opens up containerd, which is where CRI on top and runtime v2 underneath both meet.

Sources And Further Reading