When /proc/self/mountinfo Becomes the Enemy: Linux Mount Namespace Contention at Scale
Source: lobsters
Netflix published a detailed post-mortem they’re calling Mount Mayhem about a class of performance degradation they hit when scaling containers on modern, high-core-count CPUs. The short version: operations that look completely unrelated to mount namespaces, like a simple stat() or open() call on a file, were stalling for hundreds of milliseconds under load. The culprit was contention on global kernel locks protecting Linux’s mount namespace subsystem.
This is worth unpacking carefully, because the problem is not unique to Netflix, and the pattern of how it manifests is instructive for anyone running containers at density on modern hardware.
The Lock in Question
Linux containers get their filesystem isolation from mount namespaces, a feature that has been in the kernel since 2.4.19. Every container gets its own struct mnt_namespace, which holds a list of active mounts. This is standard, well-understood stuff.
The problem surfaces when you read /proc/self/mountinfo or /proc/mounts. Those files are implemented as seq_file handles in fs/proc_namespace.c, and reading them requires acquiring the namespace’s namespace_sem, a struct rw_semaphore that serializes access to the mount list. The read walks the entire list of mounts in the namespace linearly.
That is O(n) in mount count, per reader, under a lock. For a container with 50 bind mounts, each read of /proc/self/mountinfo touches all 50 entries while holding a read lock. This is not normally a problem. But on a host running 400-500 containers simultaneously, with container runtimes, health checkers, and init systems all periodically reading mountinfo, you end up with hundreds of concurrent readers hammering the same code paths across hundreds of namespaces.
The second lock that creates worse problems is mnt_id_lock. This is a global spinlock in fs/namespace.c used to allocate unique mount IDs for every new mount point, system-wide. Not per-namespace. Global. Every container that mounts anything, every bind mount setup during container initialization, every overlayfs layer, all serialize through this one spinlock.
NUMA Makes It Worse
Modern server CPUs are not monolithic. A 192-core AMD EPYC system is actually multiple chiplets connected by an interconnect fabric, with each group of cores having local memory and local caches. This is NUMA topology, and it means that memory access times are not uniform: hitting a cache line held by a core on a different NUMA node can cost 100-300 nanoseconds, versus around 5 nanoseconds for an L1 hit.
Global kernel spinlocks are cache lines. When hundreds of cores on a 4-socket machine all try to acquire mnt_id_lock, the cache line holding that lock bounces across NUMA nodes constantly. Every acquisition from a “remote” NUMA node pays the full cross-node penalty. Under high contention, CPUs spend most of their time not actually doing work, but waiting for the cache line to arrive from wherever it currently lives.
perf makes this visible. Netflix’s profiles reportedly showed upwards of 30-50% of CPU cycles going into lock-related kernel code paths during container startup stress tests, rather than into application work. The CPU count is not the bottleneck; the global lock is, and adding more cores does not help. It makes things worse because it increases contention.
Why Container Runtimes Trigger This So Hard
Container startup is mount-heavy by design. A typical container launch with containerd and overlayfs involves:
- Creating an
overlayfsmount combining multiple layer directories - Bind mounting
/dev,/proc,/syssubtrees - Bind mounting any volume mounts the user configured
- Potentially setting up
devpts,tmpfs, andcgroupmounts
Each of those mounts allocates a mount ID via mnt_id_lock. Then, once the container starts, its init process (and often its runtime health check logic) reads /proc/self/mountinfo to verify its filesystem state. This is not unusual behavior; tools like systemd, mount, and many container runtimes do it routinely.
Now imagine 500 containers starting concurrently on a 192-core host. You have thousands of mnt_id_lock acquisitions happening in parallel, and hundreds of concurrent /proc/self/mountinfo reads, all on a system where the cost of a contested cache line is measured in hundreds of nanoseconds.
The Linux kernel’s own documentation on locking is clear that spinlocks are appropriate for very short critical sections where contention is expected to be low. mnt_id_lock violates this assumption at scale.
What the Kernel Community Has Been Doing
This is not a new problem that Netflix discovered in isolation. Christian Brauner, who maintains the VFS and mount namespace subsystems, has been working on mount namespace scalability for several kernel cycles. The Linux 6.8 release included significant refactoring of mount ID handling and namespace internals.
The direction of the upstream work involves several angles. One is replacing the global mnt_id_lock spinlock with a more scalable ID allocator, such as an XArray-based or per-CPU approach, so mount ID allocation no longer serializes across all CPUs. Another is reducing the scope of namespace_sem during /proc/self/mountinfo reads, possibly by snapshotting the mount list rather than walking it under the lock.
Miklos Szeredi, who has maintained the VFS layer for years, and Brauner have both reviewed patches in this area on the Linux kernel mailing list. Some of this work landed in Linux 6.8 and 6.9; some is still in progress.
Netflix’s contribution here was empirical validation at scale and the kind of production perf data that is hard to synthesize in a lab. Kernel developers can reason about lock contention theoretically, but data showing 50% of CPU cycles disappearing into mnt_id_lock on a real production workload is a compelling argument for prioritization.
Mitigations Before the Kernel Fix
For operators running containers now, before these kernel changes land in your distribution, a few things help.
Reduce mount count per container. Every bind mount is another mnt_id_lock acquisition at startup and another entry to traverse during mountinfo reads. Auditing and trimming unnecessary mounts from your container specs reduces both the lock acquisition count and the time spent under namespace_sem.
Avoid reading /proc/self/mountinfo from hot paths inside containers. Some tooling reads it repeatedly to detect filesystem changes. If that tooling is running at high frequency inside dense container environments, it contributes to lock pressure. The inotify API is a better tool for filesystem change detection than polling mountinfo.
Stagger container starts. Batch startup at full concurrency maximizes the window during which mnt_id_lock contention peaks. Container orchestrators that support configurable startup concurrency limits can spread the mount allocation work over a longer window.
Kernel version matters. If your container host can run Linux 6.8 or later, the mount namespace work that landed there provides measurable improvement over 5.x kernels on high-core-count systems.
The Broader Pattern
What makes this story interesting beyond the specific bug is what it illustrates about the relationship between Linux kernel primitives and modern hardware. Spinlocks and global locks were designed for a model of hardware that does not exist anymore. A 192-core NUMA system is fundamentally different from the 4-core systems where much of this kernel infrastructure was designed.
The container ecosystem adopted Linux namespaces heavily, correctly, because they provide strong isolation with minimal overhead at reasonable scale. What “reasonable scale” means has shifted dramatically as CPU core counts have grown. A host that ran 50 containers in 2015 now runs 500, on hardware with 4x the core count and substantially worse lock contention characteristics.
The Linux kernel has been adapting, but the adaptation is reactive. Percpu-rwsems, scalable ID allocators, and NUMA-aware data structures exist in the kernel, but they get applied to specific subsystems only after someone demonstrates the problem at scale. Netflix’s post is valuable precisely because it is that demonstration for mount namespaces.
Container runtimes and operators built their systems on the assumption that namespace operations are cheap. They are cheap in isolation. The interesting engineering problem, the one that shows up in post-mortems like this one, is that cheap-in-isolation and cheap-under-concurrency are different properties, and on modern hardware the gap between them can swallow a significant fraction of your CPU budget.