· 6 min read ·

The Global Semaphore That Turns 192 Cores Into a Single-Threaded Mount Queue

Source: lobsters

Modern server hardware has been moving steadily toward higher core counts and multi-socket NUMA topologies. A single machine with 192 cores across two or four NUMA nodes is now an ordinary configuration for cloud-scale infrastructure. The Linux kernel has kept up with this trend in many places, but the VFS mount namespace subsystem is one area where the original design assumptions quietly stopped holding at container density.

Netflix’s post-mortem on this exact problem is one of the more instructive kernel debugging stories to come out of a production engineering team in a while. The short version: running many containers on a high-core-count NUMA server caused severe CPU stalls rooted in lock contention inside the mount namespace subsystem. But the details of why are worth unpacking, because they point to a class of kernel scalability problem that will keep showing up as container density continues to climb.

How Mount Namespaces Work Internally

Mount namespaces, introduced in Linux 2.4.19, give each process its own view of the filesystem hierarchy. When you call unshare(CLONE_NEWNS) or clone(CLONE_NEWNS), the kernel calls copy_mnt_ns() in fs/namespace.c, which walks the parent namespace’s mount tree and duplicates it into a new mnt_namespace struct. Each mount point is represented by a mount struct (formerly vfsmount) linked into a tree.

Coordinating all this is namespace_sem, a global rw_semaphore declared at the top of fs/namespace.c:

static DECLARE_RWSEM(namespace_sem);

This lock is held in write mode during namespace creation and cloning, and in read mode during a surprising number of read operations, including reads from /proc/<pid>/mountinfo. The global scope is the critical detail: every mount namespace operation on the entire system competes for this single lock, regardless of which namespaces are actually involved.

For a system running a dozen namespaces, this is completely fine. The lock is rarely contended and the overhead is negligible. For a system running 2,000 containers, each with its own namespace, you have thousands of potential contestants serializing through a single point.

The /proc/mountinfo Problem

The /proc/<pid>/mountinfo file is how userspace inspects mount namespaces. Tools like systemd, containerd, and monitoring agents read it constantly, and the kernel generates its contents by walking the mount tree of the target process’s namespace. In fs/proc/task_mnt.c, this involves acquiring namespace_sem as a reader:

down_read(&namespace_sem);
/* walk the mount tree and emit lines */
up_read(&namespace_sem);

A reader-writer semaphore allows multiple concurrent readers, which sounds fine. The problem is that rw_semaphore in the Linux kernel uses a fairness mechanism to prevent writer starvation: once a writer is waiting, new reader acquisitions block. This means a single unshare(CLONE_NEWNS) call waiting to clone a namespace can momentarily freeze all concurrent /proc/mountinfo reads across the entire system.

At container startup, you often get both at the same time: the container runtime calls clone() with CLONE_NEWNS (a writer), while monitoring and service discovery tools are reading /proc/*/mountinfo for the containers already running (many readers). The writer blocks, the readers pile up behind it, and you get a convoy effect. On a 192-core machine, “piling up” means hundreds of threads stalling simultaneously.

NUMA Makes Everything Worse

The NUMA dimension amplifies the contention significantly. A rw_semaphore is a single kernel object living at a fixed physical memory address. On a two-socket system, that memory address belongs to one socket’s local DRAM. When a thread on the other socket tries to acquire or release the semaphore, it pays remote memory access latency, which can be 2-3x higher than local access depending on the interconnect.

When hundreds of threads across both sockets are competing for namespace_sem, you get cache line bouncing: the semaphore’s internal counters and wait-list are being read and written by CPUs on both sockets, causing each write to invalidate the cache line on all other CPUs that hold a copy. This is the classic NUMA scalability anti-pattern, and it’s why the problem appears most sharply on multi-socket systems even when single-socket core counts are already high.

The Linux kernel has addressed this pattern in other subsystems using techniques like per-CPU counters, MCS locks, and queue-based spinlocks (the qspinlock used by default since Linux 4.2). The VFS path cache uses rcu_read_lock() heavily for exactly this reason. Mount namespaces haven’t gotten the same treatment.

Mount Propagation and mnt_lock

namespace_sem isn’t the only lock in play. Mount propagation, which governs how mount events in one namespace affect peer and slave namespaces, uses a separate spinlock_t called mnt_lock (also in fs/namespace.c). Operations that change the mount tree in a namespace with propagation relationships, like bind-mounting a directory, walk the propagation tree under mnt_lock.

At high container counts with shared subtrees, this lock can become its own bottleneck separate from namespace_sem. The two locks interact: some operations acquire both, which introduces additional ordering constraints and potential for lock inversion if care isn’t taken. Netflix’s containers likely used private mount namespaces with limited propagation, but the propagation lock is a lurking issue for deployments that use shared subtrees between namespaces.

What the Kernel Community Has Done About It

Reducing the scope of namespace_sem has been an ongoing effort in the kernel. Al Viro, the primary VFS maintainer, has made incremental changes over several kernel versions to narrow the critical sections. One direction is moving /proc/mountinfo generation to use RCU (Read-Copy-Update) rather than holding namespace_sem for the entire walk, which would let reads proceed without blocking writers at all.

Linux 5.11 introduced mount_lock, a separate seqlock_t used to protect mount sequence numbers, which reduced some of the namespace_sem hold time. Linux 6.x series work has continued to push more of the mount tree traversal under RCU-compatible access patterns. The mnt_id_unique work in recent kernels also cleaned up some namespace bookkeeping that was previously done under the global lock.

But none of this has fully eliminated the global bottleneck yet. The fundamental challenge is that mount propagation requires global knowledge: when you mount something, the kernel needs to know which other namespaces should see the event, which means touching data structures spanning multiple namespaces atomically. That’s hard to do without some form of global coordination.

Practical Mitigations at the Deployment Level

Until the kernel catches up, there are deployment-level approaches that reduce pressure on namespace_sem. The most effective is reducing the number of mount points per container. Each mount point in a namespace adds to the work done during copy_mnt_ns() and to the size of the /proc/mountinfo output. Containers that inherit a large host mount tree, or that have many bind mounts layered on top, hit the lock harder than containers with minimal mount trees.

Using private mount namespaces with MS_PRIVATE propagation (rather than MS_SHARED) reduces the propagation tree that mnt_lock has to walk, at the cost of not propagating mount events between namespaces. For most container workloads, this is the right tradeoff.

Reducing the frequency of /proc/mountinfo reads from monitoring agents also helps. If a tool reads mountinfo every second for every container on the host, it’s generating continuous reader pressure that amplifies the writer-starvation dynamic. Batching or caching these reads at the agent level can meaningfully reduce lock contention.

The Broader Pattern

Mount namespaces are not unique in hitting this kind of scalability wall. Network namespaces have had similar issues with net_mutex and the RTNL lock. PID namespaces have their own coordination overhead. The Linux namespace subsystem was designed when namespaces were expected to number in the tens, used primarily for containers in a development or staging context. Production deployments at Netflix scale, where a single host might have thousands of containers over its lifetime, stress the original design far beyond what was envisioned.

The kernel is adapting, but kernel changes move slowly and safely. In the meantime, understanding the internals well enough to engineer around them, as Netflix has done, is the practical path. The detailed performance analysis in their writeup is a good model for this kind of work: trace down to the lock, understand the acquisition patterns, measure the NUMA behavior, then evaluate both kernel-level and deployment-level interventions. The kernel will eventually make this better. Until then, fewer mounts and private namespaces go a long way.

Was this interesting?