Chapter 11: Images, Content, And Snapshots

An image pull in containerd is four operations on four different things: name resolution, content download, image-metadata recording, and (optionally) layer unpacking through a snapshotter.

The four things are an image reference, the registry-style name a user types; content blobs, the digest-addressed bytes that make up manifests, configs, and layers; image metadata, which maps a name to an OCI descriptor target; and snapshots, which are filesystem states produced by applying layers in order.

The reason to keep them apart is that containerd can be in any of the partial states between them. Image metadata can exist without all the content being present. Content can exist without a mounted root filesystem. A snapshot chain can exist without a running process. A running task's writable layer is a separate object from the committed image layers it sits on top of.

flowchart LR ref[Image reference] --> resolver[Resolver] resolver --> content[Content store blobs] content --> image[Image metadata target] content --> unpack[Unpacker] unpack --> snapshots[Committed snapshots] snapshots --> active[Active container snapshot]

The Content Store Is For Bytes

The content store is digest-addressed storage for immutable blobs: manifests, indexes, image configs, and compressed layers.

The interface in core/content/content.go is small enough to fit on one screen:

type Store interface {
    Manager
    Provider
    Ingester
}

Provider reads committed content by descriptor. Ingester handles active writes. Manager exposes metadata operations such as labels. An ingestion is invisible to readers until it is committed; after commit, callers can read the blob by digest and inspect or update its labels.

That commit boundary is the reason image pulls need leases. While a pull is running, containerd carries temporary blobs, partially traversed descriptors, and metadata that the garbage collector would happily delete if it ran in the gap. The pull path wraps the whole operation in a lease so the collector and the pull cannot race.

Image Metadata Is A Name To A Descriptor

The image store is not the image bytes. It maps a name to an OCI descriptor target — usually an image index or a manifest — and the content store holds whatever the descriptor points to.

The helpers around images.Image make the split visible. They resolve the image config, manifests, rootfs diff IDs, and size by walking the descriptor graph through a content provider. The image record is the named root of that graph; the content provider supplies the bytes at every node.

That split is why retagging an image is a metadata-only operation while fetching a missing layer is a content operation. It is also why deleting an image name does not delete the underlying blobs. The bytes might still be reachable from another image, a container, or a lease, and only the garbage collector — after looking at the whole reference graph — gets to decide.

Pulling An Image

A pull starts with a reference and a resolver, not with a download. The resolver locates the registry content. The fetcher walks the descriptor graph, downloading what it does not already have. The image service records a final image object at the end. If the caller asked for unpack, an unpacker is hooked into the traversal so layers are applied to the selected snapshotter as content arrives.

In containerd v2.3.0 the path through Client.Pull does six things, in order:

  1. wrap the operation in WithLease;
  2. resolve the reference to a descriptor;
  3. fetch content by walking the descriptor graph;
  4. if WithPullUnpack is set, run the unpacker as part of the walk;
  5. wait for unpack completion before creating the final image object;
  6. reject Docker schema 1 manifests, which are no longer accepted as of containerd 2.1.

A pull is therefore a descriptor walk with content ingestion, optional unpack, metadata creation, and lease protection.

Snapshotters Are Filesystem State Machines

A snapshotter is a small state machine over filesystem snapshots, with three states — active, view, committed — and five operations:

Image layers become committed snapshots; the container gets a new active writable snapshot on top of the committed chain at start time. That active snapshot is the container's writable layer. The image's committed layers stay immutable underneath it.

The default Linux backend is overlayfs, but the interface also has native, btrfs, zfs, blockfile, devmapper, Windows, and remote/lazy backends. Everything above this interface — containers, tasks, CRI — talks to "a snapshotter" without caring which one.

Unpack Connects Content To Snapshots

Unpacking is where image config and layer descriptors meet filesystem state. The unpacker reads the image config, lines layer descriptors up against the rootfs diff IDs, computes chain IDs, applies layers, and commits snapshots under stable names.

The core prepare/apply/commit loop in core/unpack/unpacker.go is three calls per layer:

mounts, err = sn.Prepare(ctx, key, parent, opts...)
diff, err := a.Apply(ctx, desc, mounts, ...)
err = sn.Commit(ctx, chainID, key, opts...)

Three identifiers travel together and must not be confused. The descriptor digest names the compressed blob in the content store. The diff ID names the uncompressed filesystem change after the layer is applied. The chain ID is a digest over the sequence of diff IDs up to and including the current layer; it becomes the snapshot key for the committed layer chain.

That is also why unpack waits for the image config. The config's rootfs.diff_ids is the source of truth for the uncompressed changes containerd expects. After applying a layer, containerd recomputes the diff ID and checks it against the config; only on a match does it commit the snapshot.

Garbage Collection References

Unpack also writes the labels that let the garbage collector cross from content into snapshots. After verifying a layer, unpack stamps the layer's content blob with a label for its uncompressed diff ID. After committing the chain, it stamps the image config with a snapshot GC reference label for the selected snapshotter, pointing at the final chain ID.

Without those labels, the collector cannot bridge the two stores. The content store and the snapshotter are separate systems with separate garbage; the labels are the edges that turn them into one reference graph. An image config points at a final chain ID, that chain ID depends on earlier committed snapshots, and the collector preserves the whole subgraph as long as anything reachable still references it.

Leases sit on top of that. A pull, unpack, or container-creation flow takes a lease to protect the in-progress set of content, metadata, and snapshots from collection until it has assembled a consistent result. Without leases, the collector would race the pull and delete in-progress objects.

From Image Chain To Container Rootfs

Committed snapshots for the image chain are not a running container; a client (or the CRI path) calls Prepare with the chain as parent, and task creation asks the snapshotter for mounts and passes them to runtime v2.

The full path from a typed reference to a mounted rootfs is seven steps:

  1. Resolve an image reference.
  2. Fetch descriptors and blobs into the content store.
  3. Record image metadata pointing at the descriptor target.
  4. Unpack layers into committed snapshots.
  5. Prepare an active snapshot for the container.
  6. Pass snapshot mounts to task creation.
  7. Let the shim and runtime mount the root filesystem for execution.

No single object in that list is "the image" in every sense. The name, the bytes, the metadata, the committed chain, and the active rootfs are five different objects with five different lifetimes.

Where This Goes

A missing blob is a content problem. A missing unpacked root is a snapshot problem. A wrong tag is image metadata. A writable layer that will not mount is an active snapshot problem. A process that never starts is in the task and shim path — which is where chapter 12 picks up, with a container record that already knows which snapshot it owns and a task request that turns that record into live execution.

Sources And Further Reading