Chapter 3: Standards — OCI and Runtime v2

Each arrow is a different contract. CRI is owned by Kubernetes. Runtime v2 is owned by containerd. The OCI runtime spec is owned by the OCI. They were designed at different times by different groups.

What OCI Is

The Open Container Initiative is a Linux Foundation project formed at DockerCon on June 22, 2015. Its mandate is to publish open specifications for container formats and runtimes. The point is interoperability: a runtime, a registry, or a builder should be replaceable without rewriting everything above it.

OCI publishes three specifications, each versioned independently:

Runtime Specification — what a runtime consumes and how it behaves.
Image Specification — what an image is.
Distribution Specification — how images move through registries.

The Runtime and Image specs were the original two. The Distribution spec, derived from Docker's Registry HTTP API V2, reached 1.0 in 2020 and turned an existing de facto standard into an open one.

This chapter focuses on the Runtime Specification.

The OCI Runtime Specification

The runtime spec defines two artifacts and a small lifecycle.

The Bundle

A bundle is a directory on disk containing two things: a config.json file and a root filesystem the runtime will use as /.

config.json describes the desired environment in fields that map to the chapter 2 model:

process — args, env, cwd, user, capabilities, rlimits, terminal, noNewPrivileges.
root — path to the root filesystem and a readonly flag.
mounts — destination, type, source, options. The mount table the runtime should set up before exec.
hostname.
linux — the Linux-specific block: namespaces to enter or create, UID and GID mappings, devices, cgroupsPath, resources (per-controller cgroup settings), seccomp filter, maskedPaths and readonlyPaths, rootfsPropagation, sysctls.
hooks — code to run at fixed points in the lifecycle.
annotations — opaque key/value metadata.

ociVersion pins the spec version the config targets; runc rejects bundles whose major version it does not recognize.

A trimmed but realistic config.json — the kind of file runc spec generates and the shim hands runc — looks like this:

{
  "ociVersion": "1.2.0",
  "process": {
    "terminal": false,
    "user": { "uid": 0, "gid": 0 },
    "args": ["/bin/sh"],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding":  ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"],
      "effective": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"],
      "permitted": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"]
    },
    "rlimits": [
      { "type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024 }
    ],
    "noNewPrivileges": true
  },
  "root": { "path": "rootfs", "readonly": true },
  "hostname": "runc",
  "mounts": [
    { "destination": "/proc", "type": "proc", "source": "proc" },
    { "destination": "/dev",  "type": "tmpfs", "source": "tmpfs",
      "options": ["nosuid", "strictatime", "mode=755", "size=65536k"] },
    { "destination": "/sys",  "type": "sysfs", "source": "sysfs",
      "options": ["nosuid", "noexec", "nodev", "ro"] }
  ],
  "linux": {
    "namespaces": [
      { "type": "pid" }, { "type": "network" }, { "type": "ipc" },
      { "type": "uts" }, { "type": "mount" }
    ],
    "uidMappings": [{ "containerID": 0, "hostID": 100000, "size": 65536 }],
    "gidMappings": [{ "containerID": 0, "hostID": 100000, "size": 65536 }],
    "resources": {
      "memory": { "limit": 268435456 },
      "cpu":    { "shares": 512, "quota": 50000, "period": 100000 },
      "devices": [{ "allow": false, "access": "rwm" }]
    },
    "maskedPaths":  ["/proc/kcore", "/proc/keys", "/sys/firmware"],
    "readonlyPaths": ["/proc/asound", "/proc/bus", "/proc/sys"],
    "seccomp": { "defaultAction": "SCMP_ACT_ERRNO" }
  }
}

Every chapter-2 concept has a slot here: namespaces under linux.namespaces, cgroup limits under linux.resources, the rootfs under root, capabilities and noNewPrivileges under process, mounts and masked paths in their own sections. Generate a fully-populated default in any directory with runc spec, then edit it down — that is the path most runtime work takes.

The Lifecycle

A compliant OCI runtime exposes five lifecycle commands:

create — set up namespaces, mounts, cgroups, security policy. Stop at the entry point of the container's init process and wait. The user process is not yet running.
start — exec the user process.
state — report the container's status as JSON: id, status, pid, bundle.
kill — send a signal.
delete — release any state retained by the runtime.

create and start are split so the caller can do work between resource setup and process execution: attach IO, set up a network namespace from outside, run hooks. That is also why containerd's shim uses the two-phase form rather than runc run.

Hooks

Hooks let external code run at fixed points in the container's lifecycle. The current spec defines five:

createRuntime — runs in the runtime's namespace, after namespaces are created but before pivot_root.
createContainer — runs inside the container's namespace, before pivot_root.
startContainer — runs inside the container's namespace, just before exec.
poststart — after the user process starts.
poststop — after the user process exits.

There is also a prestart hook that was deprecated in spec 1.0.2 in favor of the more granular hooks above; runtimes still support it for compatibility. Hooks are how networking, GPU device injection, and a lot of runtime extension behavior get wired in without modifying the runtime itself.

OCI Runtimes In Practice

An OCI runtime is anything that consumes a valid bundle and implements those lifecycle commands. The interchangeability is the point: the layers above the runtime — containerd, Docker, Kubernetes — pick a runtime by configuration, not by code.

The common implementations:

runc (Go) — the reference implementation, donated by Docker; uses libcontainer for the Linux setup.
crun (C) — smaller and faster than runc; lower memory and quicker startup; maintained by Red Hat.
youki (Rust) — runc-equivalent semantics in a different language.
runsc (gVisor) — Google's user-space kernel; intercepts syscalls in a sandbox process. Trades performance for a stronger isolation boundary.
kata-containers — runs each container inside a lightweight VM; OCI-compatible from the caller's perspective.
nvidia-container-runtime — a thin wrapper around runc that injects GPU devices via a hook.

Containerd's Runtime v2

containerd does not exec the OCI runtime directly. It launches a shim per task, and the shim drives the runtime on its behalf. This is the runtime v2 model, defined in docs/runtime-v2.md in the containerd repo.

Why The Shim Exists

If containerd were the direct parent of every container process, three things would break:

Restarting containerd would kill containers. When the parent dies, the children either die with it or are reparented to init.
IO and exit handling would leak into containerd. Stdin/stdout pipes, exit notification, and signal forwarding are runtime-specific.
Runtime swaps would require daemon changes. With the shim contract, swapping runc for crun is a configuration change.

v1 Versus v2

The original v1 shim was defined inside containerd and predated a stable wire protocol. Runtime v2 is a versioned API: the shim binary speaks a defined ttrpc Task service, and any runtime can be wrapped behind it without touching containerd. v2 also supports a single shim serving multiple containers — typically one shim per pod sandbox in CRI mode — which cuts per-container overhead in Kubernetes.

containerd 1.0 (2017) shipped v1; 1.4 (2020) deprecated it; v2 is what every current installation uses.

Naming

A v2 shim is identified by a name like io.containerd.runc.v2. The binary name is the same string with dots replaced by dashes: containerd-shim-runc-v2. containerd resolves the name to a binary on $PATH when starting a task.

The shims you are likely to see:

Runtime name	Binary	Purpose
`io.containerd.runc.v2`	`containerd-shim-runc-v2`	Standard Linux runc path
`io.containerd.runhcs.v1`	`containerd-shim-runhcs-v1`	Windows HCS
Vendor-specific	`containerd-shim-kata-v2`, `containerd-shim-runsc-v1`, etc.	Kata, gVisor, Firecracker

The Wire Protocols

The four arrows in this chapter's opening diagram each use a different transport. Walking from top to bottom:

CRI gRPC: kubelet → containerd. The kubelet calls containerd over a Unix domain socket — /run/containerd/containerd.sock by default — using the RuntimeService and ImageService definitions in cri-api. These are full gRPC over HTTP/2: requests like RunPodSandbox, CreateContainer, and StartContainer are unary, while Attach, Exec, and PortForward use streaming. The kubelet picks the socket up from --container-runtime-endpoint; the path is the only handshake.

ttrpc Task service: containerd → shim. Once a task exists, containerd talks to its shim over ttrpc — a slimmer gRPC variant maintained at github.com/containerd/ttrpc. The protobuf definitions are gRPC-shaped, but the wire skips HTTP/2: framed protobuf over a Unix socket, no streaming, smaller binary, smaller memory. ttrpc fits here because shims are cheap and there is one per workload (or per pod) — a full HTTP/2 stack per shim would cost more than the shim itself. The Task service the shim implements covers the operations a daemon needs:

Create, Start, Delete — task lifecycle.
Exec, Kill, ResizePty, CloseIO — interactive control.
State, Pids, Wait, Stats — status and observation.
Pause, Resume, Update, Checkpoint — runtime control.
Connect, Shutdown — shim-level lifecycle.

When the shim is sandbox-aware (i.e. it can host a Kubernetes pod), a separate Sandbox service sits alongside Task.

Events: shim → containerd, via a publish binary. Containers produce asynchronous events — TaskExit, TaskOOM, TaskCreate — that the shim has to push back to containerd. Rather than open a second long-lived ttrpc connection, the shim invokes a small binary it was given at startup. containerd execs the shim with -publish-binary /usr/local/bin/containerd and -address <main socket>; the shim runs containerd publish --topic=/tasks/exit --namespace=<ns> for each event, with a serialized protobuf envelope on stdin. The publish binary is a thin client that forwards the event over ttrpc to containerd's Events service and exits. From the shim's perspective, an event is one fork/exec.

OCI runtime CLI: shim → runc. The shim does not link runc as a library. It writes the bundle to disk and invokes runc as a subprocess for each lifecycle step: runc create, runc start, runc kill, runc delete. The "protocol" is the CLI — arguments and exit codes — with several out-of-band channels:

stdio is plumbed through inherited file descriptors. When process.terminal is true, runc allocates a pseudo-terminal and sends the master fd to the shim over a Unix socket whose path the shim passes as --console-socket.
The bundle directory is the input. runc reads config.json and the rootfs from it; the path is the last positional argument to runc create.
The pid file (--pid-file) is where runc writes the init process's host PID after create returns. The shim opens it, reads the PID, and uses wait4(2) (or pidfd) to detect exit.
The state directory (default /run/runc/<id>/) holds runc's own state.json for each container, so subsequent runc kill or runc delete calls find the container without the shim having to keep any in-memory handle.

Part III walks through the runc side of these calls in detail.

The Startup Handshake

containerd starts a shim by execing its binary with the subcommand start and a fixed set of flags: -namespace (a containerd namespace, not a Linux one), -id (the task id), -address (containerd's main ttrpc socket), and -publish-binary (the events client). The shim then:

Creates its own ttrpc socket — a Unix abstract socket under containerd's state directory, e.g. /run/containerd/s/<random>.
Forks itself; the child runs the ttrpc server, the parent prints the socket address on stdout and exits.
containerd reads the address, dials it, and from then on drives the Task service over ttrpc.

The shim process — not containerd — is the parent of the container's init process. Containerd holds a ttrpc connection to the shim, the shim holds the runc invocations and the init PID, and runc has already exited by the time Start returns. If containerd restarts, the shim keeps running and keeps holding the container; on reconnect, containerd dials the existing socket and resumes management without restarting the workload.

Part III opens up the OCI runtime spec and runc. Part IV opens up containerd, which is where CRI on top and runtime v2 underneath both meet.

Sources And Further Reading

OCI homepage: https://opencontainers.org/
OCI Runtime Specification: https://github.com/opencontainers/runtime-spec
OCI Image Specification: https://github.com/opencontainers/image-spec
OCI Distribution Specification: https://github.com/opencontainers/distribution-spec
runc: https://github.com/opencontainers/runc
crun: https://github.com/containers/crun
youki: https://github.com/containers/youki
gVisor: https://gvisor.dev/
Kata Containers: https://katacontainers.io/
containerd runtime v2: https://github.com/containerd/containerd/blob/main/docs/runtime-v2.md
containerd ttrpc: https://github.com/containerd/ttrpc