Chapter 4: Namespaces
Safety: the commands below mutate kernel state and most need root. Use a disposable Linux VM. Examples here were checked on Ubuntu 24.04 with kernel 6.8 and the
util-linuxpackage supplyingunshare,nsenter, andlsns.
What Lives In /proc/<pid>/ns/
Every process exposes its current namespaces as symlinks under /proc/<pid>/ns/. Two processes share a namespace exactly when their symlinks point at the same inode.
ls -l /proc/self/ns/
# lrwxrwxrwx 1 me me 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 me me 0 ... ipc -> 'ipc:[4026531839]'
# lrwxrwxrwx 1 me me 0 ... mnt -> 'mnt:[4026531840]'
# lrwxrwxrwx 1 me me 0 ... net -> 'net:[4026531992]'
# lrwxrwxrwx 1 me me 0 ... pid -> 'pid:[4026531836]'
# lrwxrwxrwx 1 me me 0 ... pid_for_children -> 'pid:[4026531836]'
# lrwxrwxrwx 1 me me 0 ... time -> 'time:[4026531834]'
# lrwxrwxrwx 1 me me 0 ... time_for_children -> 'time:[4026531834]'
# lrwxrwxrwx 1 me me 0 ... user -> 'user:[4026531837]'
# lrwxrwxrwx 1 me me 0 ... uts -> 'uts:[4026531838]'
The _for_children entries apply to processes the current process spawns. PID and time namespace changes take effect at the next fork(2), not immediately, so the kernel exposes both the current value and the value newly-forked children will inherit.
The same information in tabular form:
lsns
lsns walks /proc/*/ns/ and groups processes by namespace; it is the easiest way to figure out which container is using which namespace in a debug session.
Creating Namespaces With unshare
unshare(1) is the user-space wrapper for unshare(2). It creates new namespaces and execs a command inside them.
A UTS namespace needs CAP_SYS_ADMIN to create directly:
sudo unshare --uts -- bash -c 'hostname inside; hostname'
# inside
hostname
# (unchanged on host)
To get the same behavior unprivileged, wrap it in a user namespace — the one namespace that does not require pre-existing privilege to create:
unshare --user --map-root-user --uts -- bash -c 'hostname inside; hostname'
# inside
hostname
# (unchanged on host)
--map-root-user writes a UID/GID mapping that maps the caller to UID 0 inside the new user namespace. The shell believes it is root; from the host it is the same unprivileged process.
PID Namespace And The PID 1 Problem
A new PID namespace makes the first process inside it PID 1.
sudo unshare --pid --fork --mount-proc -- bash -c 'echo "pid: $$"; ps -ef'
# pid: 1
# UID PID PPID ... CMD
# 0 1 0 ... bash -c echo "pid: $$"; ps -ef
# 0 2 1 ... ps -ef
--fork is required because the calling process cannot move into the new PID namespace itself; it must fork a child that becomes PID 1. --mount-proc remounts /proc inside the new mount namespace so ps reflects the new PID space.
PID 1 is special, and the special treatment is the reason container processes routinely fail to exit on docker stop or kubectl delete pod. The kernel does not deliver default signal actions to PID 1: SIGTERM will not terminate it unless the process installs a handler. To see the failure:
sudo unshare --pid --fork --mount-proc -- bash -c '
echo "pid 1 = $$"
while :; do sleep 1; done
'
# In another terminal:
# kill -TERM <pid-of-the-bash-process-on-host>
# bash continues running
The kernel will not deliver the default-action SIGTERM. SIGKILL works because it cannot be ignored. Container runtimes work around this by injecting a tiny init like tini or dumb-init that installs handlers and forwards signals to its children, and reaps zombies on their behalf.
Mount Namespaces And Propagation
A mount namespace governs which mounts a process sees. To watch a mount appear and disappear:
sudo unshare --mount -- bash
# Inside the new namespace:
mount -t tmpfs tmpfs /mnt
mount | grep /mnt
# tmpfs on /mnt type tmpfs (rw,relatime)
exit
# Back on the host:
mount | grep /mnt
# (nothing)
The host never saw the mount, because the mount namespace's mount table is private. The catch is propagation. Many distributions configure / as a shared mount, which would make the new namespace inherit shared peers and propagate events back to the host. unshare --mount defaults to --propagation private on the new mount namespace's root to prevent that.
To inspect propagation:
findmnt -o TARGET,PROPAGATION /
# TARGET PROPAGATION
# / shared
When runc sets up a container, it sets the new mount namespace root to private (or whatever linux.rootfsPropagation requests), then mounts the rootfs and special filesystems before pivot_root(2) swaps it in. Chapter 6 walks through that sequence.
Network Namespaces
Network namespaces are visible enough to deserve their own subcommand, ip netns. Unlike unshare, ip netns add keeps the namespace alive after the calling process exits by bind-mounting it to /var/run/netns/<name>.
sudo ip netns add demo
sudo ip netns list
# demo
ls /var/run/netns/
# demo
Each namespace starts with a loopback interface, which is down:
sudo ip netns exec demo ip link
# 1: lo: <LOOPBACK> mtu 65536 state DOWN ...
sudo ip netns exec demo ip link set lo up
sudo ip netns exec demo ping -c1 127.0.0.1
Connecting two namespaces requires a virtual link. The classic recipe is a veth pair:
sudo ip netns add a
sudo ip netns add b
# Create a pair with two ends.
sudo ip link add va type veth peer name vb
# Move each end into a namespace.
sudo ip link set va netns a
sudo ip link set vb netns b
# Bring up loopback and the veths.
sudo ip -n a link set lo up
sudo ip -n b link set lo up
sudo ip -n a link set va up
sudo ip -n b link set vb up
# Address each side.
sudo ip -n a addr add 10.10.0.1/24 dev va
sudo ip -n b addr add 10.10.0.2/24 dev vb
# Test.
sudo ip netns exec a ping -c1 10.10.0.2
This is one bridge away from the standard container networking pattern: replace vb going into namespace b with vb attached to a host bridge, and add NAT rules.
Joining An Existing Namespace With nsenter
nsenter(1) calls setns(2) on a target namespace and execs a command. To run a command inside a process's mount namespace:
# Find a target pid.
pgrep -f some-container-process
# Enter its mount and PID namespaces.
sudo nsenter -t <pid> -m -p -- ls /proc
kubectl exec, crictl exec, and docker exec all reduce to a setns chain plus an exec. The reason it has to chain in a specific order is that some namespace transitions invalidate previously-set state — notably, joining a mount namespace makes paths from the old namespace unresolvable, so PID namespace switches that need to read /proc must come first.
User Namespaces And Privilege Scoping
User namespaces are what make rootless containers possible: they let an unprivileged user gain root-equivalent capabilities scoped to a namespace they own.
To create one and observe the mapping:
unshare --user --map-root-user -- bash -c '
echo "id inside: $(id)"
cat /proc/self/uid_map
'
# id inside: uid=0(root) gid=0(root) groups=0(root)
# 0 1000 1
The mapping reads as <inside-uid> <outside-uid> <length>. UID 0 inside maps to UID 1000 outside, for one UID. The shell believes it is root inside; from the host it is the same unprivileged user.
A user namespace gives root-equivalent capabilities only over resources owned by that user namespace. Try to do something that requires actual host root:
unshare --user --map-root-user -- bash -c '
# Try to mount something that requires CAP_SYS_ADMIN on the host.
mount -t tmpfs tmpfs /mnt
'
# mount: /mnt: permission denied. (only privileged user can mount)
The same CAP_SYS_ADMIN, scoped to the user namespace, is enough to create more namespaces:
unshare --user --map-root-user -- bash -c '
unshare --uts -- bash -c "hostname inside; hostname"
'
# inside
# inside
Mounts owned by the host's user namespace remain off-limits, but namespace creation that requires CAP_SYS_ADMIN on the new namespace's owner works.
For unprivileged users to map UIDs other than their own, they need newuidmap(1) and newgidmap(1) (setuid helpers from shadow-utils) and entries in /etc/subuid and /etc/subgid:
grep "^$(whoami):" /etc/subuid /etc/subgid
# /etc/subuid:me:100000:65536
# /etc/subgid:me:100000:65536
These say: the user me may map host UIDs 100000 through 165535 inside any user namespace they own. This is what gives a rootless container a 64K-UID range to allocate inside.
Time Namespaces
Time namespaces offset only CLOCK_MONOTONIC and CLOCK_BOOTTIME. CLOCK_REALTIME is shared with the host and cannot be changed per-namespace.
sudo unshare --time --fork -- bash -c '
echo "Before offset:"
cat /proc/uptime
# Apply offsets via /proc/<pid>/timens_offsets.
# Format: <clock_id> <secs> <nanosecs>
# CLOCK_MONOTONIC=1, CLOCK_BOOTTIME=7
echo "1 -100 0" > /proc/$$/timens_offsets
echo "After offset:"
cat /proc/uptime
'
timens_offsets must be written before any process executes inside the namespace, which is why runtimes write it in the brief window after unshare(CLONE_NEWTIME) and before exec.
Putting It Together: A Hand-Rolled Container
The same assembly without runc: a shell with its own user, PID, mount, UTS, IPC, and net namespaces, and an Alpine rootfs as /.
# Get a small rootfs.
mkdir alpine-rootfs
docker export $(docker create alpine:3.20) | tar -C alpine-rootfs -xf -
# Run a shell in fresh namespaces with that as /.
sudo unshare \
--user --map-root-user \
--pid --fork --mount-proc \
--mount --uts --ipc --net \
-- chroot alpine-rootfs /bin/sh
# Inside:
# / # hostname
# (empty -- new UTS namespace)
# / # ps
# PID USER TIME COMMAND
# 1 root 0:00 /bin/sh
# 2 root 0:00 ps
# / # ls /
# bin etc lib mnt proc root sbin sys tmp usr var
This is intentionally crude. There is no cgroup yet, no seccomp, no capability dropping, no pivot_root (just chroot), no IO or signal supervision, no networking inside the new netns.
OCI Mapping
The runtime spec's linux.namespaces array names which namespaces to enter or create. Each entry is {type, path}; if path is set, the runtime calls setns(2) on it instead of creating a new one.
This is how Kubernetes shares a pod's network namespace across containers — every container in the pod has its config's network entry pointing at the same /proc/<pid>/ns/net path. The first container creates the namespace; the rest join it.
The linux.uidMappings and linux.gidMappings arrays carry the user-namespace ID mappings. linux.timeOffsets carries the time-namespace clock offsets.
Where This Goes
Namespaces by themselves do not constrain CPU or memory and do not prevent a process from forking until the host runs out of process slots. The next chapter — cgroups v2 — covers the resource side.
Sources And Further Reading
namespaces(7): https://man7.org/linux/man-pages/man7/namespaces.7.htmlunshare(1): https://man7.org/linux/man-pages/man1/unshare.1.htmlnsenter(1): https://man7.org/linux/man-pages/man1/nsenter.1.htmluser_namespaces(7): https://man7.org/linux/man-pages/man7/user_namespaces.7.htmlpid_namespaces(7): https://man7.org/linux/man-pages/man7/pid_namespaces.7.htmlmount_namespaces(7): https://man7.org/linux/man-pages/man7/mount_namespaces.7.htmlnetwork_namespaces(7): https://man7.org/linux/man-pages/man7/network_namespaces.7.htmltime_namespaces(7): https://man7.org/linux/man-pages/man7/time_namespaces.7.html- OCI Runtime Spec, namespaces: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#namespaces