The NUMA Bottleneck Inside Linux Mount Namespace Propagation

When Netflix engineers traced a pattern of CPU spikes on their container hosts, the investigation eventually led into a corner of the Linux kernel that most container platform operators never have to think about: the mount namespace propagation subsystem. The findings describe a scalability problem with specific kernel-level causes, and the interaction between mount peer groups and modern NUMA hardware is worth understanding in detail.

What the Kernel Tracks for Every Mount

Linux mount namespaces, created via clone(CLONE_NEWNS) or unshare(CLONE_NEWNS), give each container an isolated view of the filesystem hierarchy. The kernel represents the mount table for each namespace as a collection of struct mount entries, organized in a hash table keyed on (dentry, parent_mount) pairs.

The internal structure of struct mount carries more state than most people realize:

struct mount {
    struct hlist_node mnt_hash;
    struct mount      *mnt_parent;
    struct dentry     *mnt_mountpoint;
    struct list_head  mnt_share;       /* circular peer group list */
    struct list_head  mnt_slave_list;  /* mounts receiving propagation */
    struct mount      *mnt_master;     /* propagation source */
    struct mnt_namespace *mnt_ns;
    /* ... */
};

The mnt_share field is a circular doubly-linked list connecting all mounts at equivalent positions across different namespaces, when those positions are marked as shared. This is the peer group, and it is the data structure at the center of the performance problem.

Mount propagation modes are documented in the kernel’s shared subtrees documentation. When a mount point is shared — either explicitly via mount --make-shared or by inheriting from a parent namespace — any event that affects it must propagate to every peer in the group. The kernel walks the mnt_share list to do this. With N containers sharing a peer group, that walk is O(N).

Why Reading /proc/mountinfo Is Not Cheap

The performance impact becomes visible through a routine operation: reading /proc/self/mountinfo. Container runtimes check it on startup. Health monitoring tools query it continuously. System libraries like getmntinfo(3) parse /proc/mounts as part of normal initialization. On a host starting 200 containers in a short window, hundreds of these reads happen concurrently.

Generating the output requires the kernel to acquire namespace_sem as a read lock, then iterate every struct mount in the namespace. Each container typically carries 30 to 50 mounts: an overlayfs root, bind mounts for /dev, /proc, /sys, and whatever application-specific paths the runtime adds. Individually, the iteration cost is modest.

The problem is lock interaction. The kernel’s namespace.c uses a seqlock (mount_lock) to protect mount table lookups. Seqlocks are read-optimized: you read a sequence number, perform your work, re-read the sequence number, and retry if it changed. This works well when writes are rare. When hundreds of containers are simultaneously initializing and each namespace setup touches the mount table, writes become frequent, and readers spend time retrying rather than making progress.

The retry loop means CPU time nominally spent “reading mount information” is actually spent spinning on a shared counter. The more concurrent namespace operations, the worse the retry rate becomes.

How NUMA Hardware Turns a Slow Path Into a Hot Path

The “modern CPUs” in Netflix’s title points at a hardware topology change the Linux mount subsystem was not designed for. A dual-socket AMD EPYC Genoa server with sub-NUMA clustering enabled can expose 8 NUMA nodes to the OS. Intel Sapphire Rapids uses a tile-based chiplet design that creates similar intra-socket access asymmetry. The OS scheduler sees a topology where accessing memory on a remote NUMA node costs 100 to 300ns, versus 4 to 10ns for a local L3 cache hit.

The mount_lock seqlock and the mount hash table are global data structures. They are not NUMA-aware. A CPU on node 5 reading a cache line last written by a CPU on node 0 crosses the Infinity Fabric or UPI interconnect. On a machine with 256 hardware threads spread across 8 NUMA nodes, the probability that two concurrent operations land on the same node is low. Every seqlock retry touching a remotely-cached sequence counter pays full cross-NUMA latency.

At Netflix’s container density, this is not an edge case. The mount_lock hot cache line bounces continuously between NUMA nodes, and each bounce adds latency to every operation touching the mount table. The CPU utilization numbers look like lock contention because they are lock contention, mediated by physical hardware topology rather than software queuing.

Diagnosing It

The investigation path for this kind of problem starts with perf record run system-wide during a spike:

perf record -ag -F 999 -- sleep 10
perf report --no-children

The flame graph output from Brendan Gregg’s flamegraph tooling will show unexpectedly wide frames in kernel paths like __legitimize_mnt, mnt_get_count, or the seqlock retry loops in namespace.c. These function names do not immediately suggest container-related overhead, which is part of why the problem is easy to misattribute to application code or network I/O.

Once you have a hypothesis, bpftrace lets you quantify it without kernel modifications:

bpftrace -e 'kprobe:lock_mount_hash { @[cpu] = count(); }'

Cross-referencing CPU hit counts with NUMA topology via numactl --hardware reveals whether the lock is bouncing across nodes or staying local. Even distribution across all CPUs with high absolute counts confirms the worst case: every NUMA node is contributing to the contention.

The Fix Space

Reducing peer group depth is the most direct mitigation. If containers do not require mount propagation from the host, making mounts private before entering the container namespace eliminates the peer group chain:

mount --make-rprivate /
unshare --mount

Modern container runtimes like containerd and crun apply this by default. Older runtimes, custom setups, and anything built before the performance implications of shared subtrees were understood often do not. Auditing which mount points in a container have non-private propagation and trimming unnecessary shared subtrees provides a significant reduction in peer group walk length.

Kernel-level fixes are the other lever. The commit history for fs/namespace.c shows ongoing work across the 5.x and 6.x series to reduce lock scope in the mount path, shift more of the read-side to RCU-only operation, and limit propagation walk depth. Netflix’s scale means they can reproduce these performance cliffs reliably, which makes them well-positioned to either backport relevant patches or identify new ones.

Scheduling constraints are a third option, though operationally expensive. Ensuring containers sharing a peer group are always scheduled onto CPUs within the same NUMA domain keeps the seqlock’s cache line local. Netflix’s Titus scheduler has the infrastructure to express these constraints, but maintaining them correctly as workloads shift is its own challenge, and most container platforms do not have this capability built in.

What This Means for Container Platform Design

Mount namespaces predate modern multi-socket hardware as a common workload platform. The peer group mechanism solves a real problem — keeping shared filesystems consistent across namespaces — but the linked list traversal and global seqlock were not designed with hundreds of concurrent namespaces and complex NUMA topologies in mind.

Container platform designers working on high-density hosts should treat mount propagation settings as a performance-relevant configuration decision rather than just a security boundary. The default behavior of inheriting shared subtrees from the host is correct from a correctness standpoint and often wrong from a performance standpoint. Measuring mount-related kernel time during container burst scenarios is worthwhile before reaching the density at which this becomes a production problem.

Linux container isolation mechanisms were designed to be correct first and scalable later. Network namespaces had similar scalability problems years ago — per-namespace routing tables, conntrack tables, and socket accounting all required targeted fixes — and most of those have been addressed. Mount namespaces appear to be further along in that process now, partly because Netflix documented what they found and shared it publicly.