· 7 min read ·

One Global Lock, 192 Cores: How Linux Mount Namespaces Break at Container Scale

Source: lobsters

The discovery Netflix documented in their tech blog is one of those performance problems that looks mysterious from the outside and makes complete sense once you understand the kernel’s mount architecture. They were running containerized workloads on 192-core AMD EPYC servers and found, under load, that a disproportionate fraction of CPU time was vanishing into kernel code, specifically the mount subsystem, rather than into their application or I/O.

The culprit was a global seqlock that every container runtime, every container startup, and every monitoring agent on the system was contending simultaneously. At 192 cores with hundreds of containers per host, that contention compounds quickly.

How mount namespaces are structured

When a container runtime creates a container, it calls clone(2) or unshare(2) with CLONE_NEWNS to give the container its own mount namespace. From that point on, the container has its own view of the filesystem hierarchy. Mount operations inside the container do not affect the host. This is the foundational mechanism behind container filesystem isolation.

The kernel represents a mount namespace with struct mnt_namespace in fs/mount.h:

struct mnt_namespace {
    struct ns_common    ns;
    struct mount       *root;
    struct list_head    list;       // every mount in this namespace
    struct rw_semaphore sem;        // protects the list
    unsigned int        mounts;     // count of mounts
    // ...
};

Every mount in the namespace sits on list. To iterate that list safely, you need sem. Every container has its own mnt_namespace, its own list, its own sem. So far, this composes reasonably. The contention is per-namespace, not system-wide.

The problem sits a layer above that: a global seqlock_t mount_lock declared in fs/namespace.c:

__cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);

This lock coordinates mount and unmount operations across the entire system. When anything modifies the mount tree anywhere, it acquires mount_lock as a writer. Readers take it as a sequence reader: they record the sequence counter before their operation and check it again after, retrying if the counter changed, which indicates that a write occurred during their read.

On a two-core machine, this is acceptable. On a 192-core machine, a single writer incrementing mount_lock causes all 191 potential readers to detect the changed counter and retry their entire operation. Each retry re-walks the namespace’s mount list. The cost of a single mount event becomes proportional to the number of cores multiplied by the number of mounts in the namespace.

The /proc/self/mountinfo bottleneck

/proc/self/mountinfo is a synthetic file generated by the kernel that describes every mount in the current process’s namespace. A typical line looks like:

36 35 8:1 / /home rw,relatime shared:3 - ext4 /dev/sda1 rw,errors=remount-ro

The fields encode mount ID, parent mount ID, device numbers, the mount root within the filesystem, the mount point relative to the process root, mount options, propagation type, filesystem type, source device, and superblock options.

The kernel generates this file on demand in fs/proc/task_mnt.c. The generation holds namespace->sem for the entire duration of the read:

static void *m_start(struct seq_file *m, loff_t *pos)
{
    struct proc_mounts *p = m->private;
    down_read(&p->ns->sem);   // acquired here
    return seq_list_start(&p->ns->list, *pos);
}

static void m_stop(struct seq_file *m, void *v)
{
    struct proc_mounts *p = m->private;
    up_read(&p->ns->sem);     // released here, after full iteration
}

Between m_start and m_stop, the kernel iterates every mount in the namespace and writes one line per mount. If the namespace has 40 mounts, that is 40 lines of formatted output generated under the lock. If 20 threads concurrently try to read mountinfo for the same namespace, 19 of them block on down_read while the first one finishes.

With hundreds of containers on a host, each with its own namespace being read by multiple concurrent readers, you get hundreds of locks all being contended simultaneously, each churning through mount_lock whenever any mount event occurs anywhere on the system.

Who reads mountinfo, and how often

The list of things that read /proc/self/mountinfo is longer than most people expect:

  • Container runtimes (runc, crun, containerd): read mountinfo during container setup to verify that mounts were applied correctly. runc reads it in multiple places during libcontainer’s rootfs preparation, using the moby/sys/mountinfo library, which opens and parses the file fresh on every call with no caching.
  • kubelet: Kubernetes’s node agent reads mountinfo for volume management and garbage collection of orphaned mounts.
  • systemd: reads it on service activation to check mount state.
  • Monitoring agents: Prometheus’s node exporter reads it to expose filesystem metrics.
  • Any program using libmount: the util-linux libmount library, which backs mount, umount, and findmnt, reads /proc/self/mountinfo as its primary data source.

Each container startup involves multiple reads. Each monitoring scrape involves at least one. On a host running 200 containers with a 15-second scrape interval and continuous workload scheduling, this adds up to thousands of mountinfo reads per minute, each holding namespace->sem for the duration of the read.

The mount count problem

A vanilla runc container with no additional volumes gets at least twelve to fifteen mounts by default: the overlay rootfs mount, /proc, a series of read-only bind mounts under /proc/bus, /proc/fs, /proc/irq, /proc/sys, and /proc/sysrq-trigger to mask sensitive kernel interfaces from inside the container, plus /dev as a tmpfs, /dev/pts, /dev/shm, /dev/mqueue, and /sys with cgroup mounts. Add a few application volumes and you are at 30 to 50 mounts per container.

The OCI runtime spec allows runtimes to omit some of these. The /proc/bus and /proc/irq bind mounts exist primarily as a security measure, preventing container processes from accessing those interfaces. If the threat model allows it, or if the process runs in a restricted user namespace where those interfaces are already inaccessible, they can be removed. Netflix’s approach included reducing the number of mounts per container as one component of the fix, trimming the default mount list in their container configurations.

The impact is multiplicative: halving the number of mounts per container halves the time each mountinfo read holds namespace->sem, which roughly halves the contention per namespace. Across 200 containers, that reduction compounds significantly.

Bind mounts and overlayfs have different cost profiles here. An overlay rootfs contributes one mount entry per container, regardless of how many files are in the image. Each volume bind mount contributes one mount entry. Containers with many volumes, or container runtimes that use bind mounts where overlayfs would suffice, end up with inflated mount counts and correspondingly worse mountinfo generation cost.

What the kernel community has been doing

The mount namespace locking architecture has been a known scalability concern for years. LWN covered the design in depth in a 2016 two-part series on mount namespace semantics, and the scalability limits of the global seqlock have come up in kernel mailing list discussions repeatedly since then.

Al Viro, the primary VFS maintainer, has been incrementally reducing the scope and hold time of mount_lock across multiple kernel versions. Christian Brauner, who has contributed substantial namespace work, has pushed patches reducing contention in specific code paths.

The deeper fix requires either making mount_lock per-namespace (complex, because some operations genuinely need to coordinate across namespaces, particularly during mount propagation) or making the mountinfo read path use RCU-style lockless techniques rather than holding a semaphore for the entire read. The RCU approach is technically feasible but requires careful handling to ensure a consistent snapshot of the mount list.

One partial fix that helps immediately in practice: use inotify on /proc/self/mountinfo to watch for changes rather than polling it on a timer. The kernel posts a change notification when the namespace’s mount set changes. Tools that implement this pattern read mountinfo once, cache the result, and re-read only when they receive the inotify event. This dramatically reduces read frequency when the mount set is stable, which it usually is during steady-state container operation. Some versions of libmount support this mode; adoption in container runtimes has been inconsistent.

The sysfs equivalent, fanotify, also works for this purpose and can watch multiple paths simultaneously, though the integration work required in container tooling is non-trivial.

What 192 cores reveal

The Netflix situation illustrates a pattern that recurs throughout systems software: global locks that perform acceptably on the hardware of their era become bottlenecks as core counts increase. The mount_lock seqlock was a meaningful improvement over earlier mutex-based designs. It scales well enough for workloads that were typical when it was designed: a few hundred mounts system-wide, mount events happening infrequently, and a modest number of cores.

Container density changes all three parameters simultaneously. A single host now carries mounts numbering in the thousands, mount events during container startup happen in bursts, and the core count has increased by an order of magnitude. The seqlock’s retry-on-write property, which is cheap when contention is rare, becomes a throughput amplifier when writers and readers are both numerous and frequent.

The Netflix tech blog post is worth reading for the diagnostic methodology: they used performance profiling to identify the kernel functions dominating CPU time, traced those back to the lock hierarchy, and worked both sides of the problem, reducing workload on the lock from userspace while contributing kernel-side fixes. That combination is the practical way to address deep infrastructure bottlenecks, since a pure kernel fix requires upstream review cycles that can take months, and a pure userspace fix can only reduce pressure rather than eliminate the architectural constraint.

The kernel will eventually get per-namespace or RCU-based mount locking that handles container-scale densities. Until then: minimize mounts per container, reduce mountinfo read frequency by using inotify where possible, and watch %sys CPU time closely when scaling container density on high-core-count hosts.

Was this interesting?