Chapter 9: runc Lifecycle

runc is containerd's default Linux runtime and the most common implementation of the OCI runtime contract. It takes the bundle from Chapter 8, turns config.json into libcontainer configuration, creates the requested Linux environment, and eventually calls execve for the configured process.

runc coordinates a parent process, an init process, namespace entry, cgroup placement, mount setup, hooks, seccomp, capabilities, labels, IO, state files, and cleanup. It is one implementation of the OCI runtime spec; crun, youki, runsc, and Kata are others.

sequenceDiagram participant caller as caller participant runc as runc participant parent as runc parent participant init as runc init participant kernel as Linux kernel caller->>runc: create bundle ID runc->>runc: read config.json runc->>parent: create parent process parent->>init: spawn runc init init->>kernel: namespaces, mounts, cgroups, security setup init-->>parent: ready, waiting on exec FIFO caller->>runc: start bundle ID runc->>init: release exec FIFO init->>kernel: execve configured process

create builds processes, namespaces, mounts, and cgroups but stops at the FIFO gate before execve; start releases the gate.

CLI Verbs

runc exposes the OCI lifecycle directly:

The run command is convenient for humans. Container managers such as containerd use the two-phase create/start shape because it gives them a setup point between environment creation and process execution.

Reading The Bundle

runc starts by entering the bundle directory, opening config.json, decoding it into the OCI Go Spec, validating the process section, and converting the spec into libcontainer configuration.

The handoff from JSON to libcontainer is small:

if err = json.NewDecoder(cf).Decode(&spec); err != nil {
    return nil, err
}
return spec, validateProcessSpec(spec.Process)

From there runc translates the spec into libcontainer's model: namespaces, mounts, cgroups, devices, process settings, hooks, and runtime flags such as systemd cgroup mode, rootless mode, and no-pivot behavior.

The Parent And The Gate

For an init container, runc creates an exec FIFO before starting the container path:

if process.Init {
    if err := c.createExecFifo(); err != nil {
        return err
    }
}

The parent process then starts a cloned copy of /proc/self/exe running runc init. That init path receives pipes, bootstrap data, the init config, logging descriptors, and namespace setup instructions. runc marks unrelated file descriptors close-on-exec before starting runc init, a hardening detail added because leaked descriptors have produced container escapes in the past (CVE-2024-21626).

The FIFO is the gate. During runc create, the init process prepares the environment and waits. During runc start, runc releases that gate so the init path can continue to the final user process.

That gate is why created and running are different states: in created, namespaces, mounts, and cgroups exist and the init process is parked; the user-specified execve has not happened yet.

Namespace Entry

runc includes C namespace-entry code because namespace operations are sensitive to process and thread state. setns(2), PID namespace creation, and clone ordering do not fit cleanly into an already-running multithreaded Go runtime.

The parent starts runc init, the C nsenter code handles low-level namespace entry and clone staging, and the Go init code reads _LIBCONTAINER_* environment variables and init config from pipes. From there the init path chooses standard init for a new container or setns init for runc exec.

crun and youki use different internal structures to satisfy the same OCI contract.

Root Filesystem Setup

The standard Linux init path prepares networking and routes, initializes labeling state, then prepares the root filesystem:

if err := setupNetwork(l.config); err != nil {
    return err
}
err := prepareRootfs(l.pipe, l.config)

prepareRootfs is where the bundle's root and mount declarations become a mount table inside the container's mount namespace. runc opens the rootfs, iterates configured mounts, creates device nodes when needed, sets up /dev/ptmx and /dev symlinks, runs parent-side hooks at the correct point, and switches root:

for _, m := range config.Mounts {
    if err := setupAndMountToRootfs(pipe, config, mountConfig, m); err != nil {
        return err
    }
}
err = pivotRoot(rootFd)

The full code path adds hardening around /proc, /sys, user-namespace device behavior, and read-only remounts. The mount list in config.json becomes kernel mount state, then pivot_root(2), MS_MOVE, or chroot(2) swaps the prepared tree in as /.

Setup Order

runc's parent and init processes synchronize because setup order matters. The parent can apply cgroups before children escape placement, move configured network interfaces after it knows the child PID, run prestart and createRuntime hooks from the parent side, and pass file descriptors to the child. The child prepares the rootfs, runs container-side hooks, applies user and group settings, labels, capabilities, noNewPrivileges, seccomp, scheduler settings, I/O priority, and cwd checks close to the final exec.

That order is security-sensitive. A seccomp filter installed too early can block setup calls. A capability dropped too late gives more privilege to setup code than intended. A cwd outside the container root can become a host filesystem exposure.

Start, State, Kill, Delete

start only operates on a created container. It releases the exec FIFO so the init path can call execve for the configured program. state reports the OCI state fields from runc's stored state and live process information.

kill has more policy than a raw kill(2). runc has special handling for SIGKILL, stopped and running states, cgroup process killing, and cases where the container does not have a private PID namespace. delete --force kills before teardown and must handle processes that may remain in the cgroup after the init process exits, especially when PID namespaces are shared.

A runtime owns the lifecycle state and teardown rules, not just the start path: delete --force reaps stragglers in the cgroup, and shared PID namespaces require killing the whole tree.

Other Runtime Answers

runc is the book's main implementation path, but the OCI contract allows different answers:

Runtime What stays the same What changes
crun OCI bundle and lifecycle C implementation and libcrun library-oriented design
youki OCI bundle and lifecycle Rust implementation and Rust abstractions for process, rootfs, cgroups, seccomp
gVisor runsc OCI-facing command shape Workload syscalls go through gVisor's userspace Sentry
Kata Containers containerd/OCI-facing manager boundary Workload runs inside a lightweight VM and guest agent

Part IV moves back up the stack to containerd, where image content, snapshots, container metadata, tasks, shims, and CRI all meet before the runtime ever receives a bundle.

Sources And Further Reading