Linux Mount Peer Groups and the O(n) Work Problem Hidden Inside Lock Contention

The Netflix tech blog post on mount namespace performance correctly identifies lock contention in the kernel’s mount subsystem as the culprit for CPU stalls at container scale. The diagnostic work is solid, and the mitigations are practical. What the post necessarily glosses over is the underlying data structure reason that makes the lock so expensive to hold: mount propagation peer groups, and the O(n) walk the kernel performs through them on every mount operation.

Understanding this layer explains why reducing the number of mounts per container is not just a quantitative fix but a qualitative one, and it also explains why the problem appears when it does rather than earlier in the scaling curve.

Shared Subtrees and Peer Groups

Linux’s shared subtrees model, introduced in 2.6.15, allows mount events in one namespace to propagate to other namespaces according to a configured relationship. A mount point can be MS_SHARED, MS_SLAVE, MS_PRIVATE, or MS_UNBINDABLE. Shared mounts form peer groups: a set of mounts across potentially many namespaces that all see the same mount and unmount events. Slave mounts receive events from a master but do not send them back.

When the kernel creates a new mount namespace via clone(CLONE_NEWNS) or unshare(CLONE_NEWNS), it calls copy_mnt_ns() in fs/namespace.c, which duplicates the parent’s mount tree. Mounts that were MS_SHARED in the parent get copies in the new namespace that are added to the same peer group. This is the default behavior when no explicit propagation type is set.

On a typical Linux server where the root filesystem is mounted MS_SHARED (which is the default for systemd-managed systems), every container created on that host joins the root mount’s peer group, along with the peer groups of every other shared mount inherited from the parent namespace.

Why Every Bind Mount Walks the Peer Group

The problem becomes apparent when you look at propagate_mnt() in fs/pnode.c. This function is called for every mount() syscall on a shared mount point, and it iterates the complete peer group to propagate the event:

int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
                  struct mount *source_mnt, struct hlist_head *tree_list)
{
    struct mount *m, *n;
    int ret = 0;

    /* iterate all peers and slaves, calling propagate_one for each */
    for (m = next_peer(dest_mnt); m != dest_mnt; m = next_peer(m)) {
        ret = propagate_one(m);
        if (ret)
            goto out;
    }
    /* also walk slaves ... */
out:
    return ret;
}

The cost of a single mount() call on a shared mount point is proportional to the size of the peer group. If 200 containers are running on the host and each joined the root mount’s peer group at creation time, then every bind mount during a new container’s setup calls propagate_mnt() over a list of 200-plus entries.

A typical runc container performs 30 to 50 bind mounts during setup: the overlay rootfs, /proc, a series of masking bind mounts under /proc/bus, /proc/irq, /proc/sys, and /proc/sysrq-trigger, plus /dev, /dev/pts, /dev/shm, and application volumes. Each of these is a separate mount() syscall. Each one, on a shared mount point with 200 peers, walks 200 entries.

For a single container starting in isolation, this is cheap. For 50 containers starting simultaneously on a 192-core machine, you have 50 times 40 bind mounts, each doing O(200) work, all serialized under the global namespace_sem rwsem. The lock hold time is long, the number of waiters is large, and the NUMA cache-line bouncing for the semaphore itself adds latency on top of the actual work.

The Fix the Runtime Should Apply

The correct mitigation is to call mount() with MS_REC | MS_PRIVATE on the new namespace’s root immediately after namespace creation, before any other mount operations. This removes the new namespace from all shared peer groups:

/* After clone/unshare with CLONE_NEWNS, but before any bind mounts: */
mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL);

With this call in place, subsequent bind mounts within the namespace do not call propagate_mnt() at all, because there are no peers. The cost of each bind mount drops from O(peer_group_size) to O(1). The reduction is not just in total work; it also dramatically reduces the time each operation holds namespace_sem, which cuts contention for all other concurrent namespace operations.

The OCI runtime spec includes a rootfsPropagation field for exactly this purpose. runc supports it. The issue is that the default propagation type, when not explicitly configured, inherits whatever the parent namespace had. On a systemd host with a shared root, that default is shared propagation. Container deployments that do not set rootfsPropagation: private are silently opting into full peer group propagation for every container they start.

Containerd’s default CRI plugin configuration passes through to runc’s defaults. Kubernetes deployments that rely on containerd without explicit runtime class configuration therefore inherit the problematic default. Setting the propagation explicitly in pod spec volume mount configurations, or in the container runtime’s global configuration, is the deployment-level fix that costs nothing in kernel patches.

How Other Systems Avoided This

FreeBSD jails do not have a peer group model. Filesystem isolation in jails is implemented by restricting the jail’s visible filesystem to a subtree of the host, with nullfs providing bind-mount-equivalent functionality. Because there is no event propagation between jails, there is no per-operation O(n) walk over all active jails. Isolation is enforced at the VFS lookup level rather than through namespace copying and propagation.

The trade-off is expressiveness. Linux shared subtrees can model relationships that FreeBSD jails cannot: a slave mount that receives events from the host without sending events back is useful for scenarios where you want network filesystem mounts on the host to automatically appear inside containers. FreeBSD’s simpler model gives up that capability in exchange for a flatter, more scalable isolation primitive.

systemd-nspawn gets the Linux behavior right by default. When creating a container with --private, it explicitly calls the equivalent of MS_REC | MS_PRIVATE before setting up the container’s mount tree. The number of bind mounts it creates by default is also smaller than runc’s, because it sets up /proc and /sys fresh inside the container rather than masking subtrees of the host’s versions. Both choices reduce peer group and mount count simultaneously.

Reading mountinfo Under Lock

The second layer of the problem is /proc/self/mountinfo. Every container runtime, Kubernetes node agent, monitoring tool, and service using libmount reads this file to inspect mount state. The kernel generates its contents by walking the namespace’s mount list in fs/proc/task_mnt.c, holding namespace_sem as a reader for the entire duration:

static void *m_start(struct seq_file *m, loff_t *pos)
{
    struct proc_mounts *p = m->private;
    down_read(&namespace_sem);
    return seq_list_start(&p->ns->list, *pos);
}

static void m_stop(struct seq_file *m, void *v)
{
    struct proc_mounts *p = m->private;
    up_read(&namespace_sem);
}

Because Linux’s rw_semaphore is fair, a writer waiting to acquire the lock will block new readers. A container startup calling clone(CLONE_NEWNS), which takes the write lock, causes every concurrent mountinfo read across the entire system to stall until the clone completes. If monitoring agents are reading mountinfo once per second for every running container, and container startups are happening continuously, the convoy effect is persistent rather than transient.

The practical mitigation is to use inotify on /proc/self/mountinfo rather than polling. The kernel posts a change notification when the mount set changes; tools that implement this pattern read and cache the file once, then re-read only on notification. During steady-state operation, when containers are running but not starting, the mount set does not change and mountinfo reads drop to near zero.

Some versions of libmount support this mode via mnt_monitor_enable_kernel(). Adoption in container runtimes and Kubernetes components has been uneven. runc’s use of moby/sys/mountinfo opens and parses the file fresh on each call, with no caching or inotify integration. Fixing this in the runtime would reduce read frequency without requiring any kernel changes.

Kernel-Level Progress

The kernel community is aware of the scalability limits. Al Viro’s ongoing VFS work has reduced the scope of namespace_sem hold times incrementally across several recent kernel versions. The direction for mountinfo reads is toward RCU-based access that does not require holding the global semaphore at all, allowing reads to proceed without blocking writers.

Fully per-namespace locking for mount operations is harder. Propagation genuinely requires coordinating across namespaces, and doing that correctly with per-namespace locks requires either a carefully ordered multi-lock acquisition protocol or deferred propagation, both of which introduce significant complexity and risk. The global lock exists because it was the conservative, correct solution for 2006-era workloads. Replacing it while maintaining correctness under propagation is not a small undertaking.

In the meantime, the container deployment space has the tools it needs to work around the problem. Set rootfsPropagation: private. Minimize bind mounts per container. Reduce mountinfo read frequency with inotify. These interventions do not require kernel changes, and their combined effect is significant enough that the worst-case scaling behavior becomes manageable on current hardware while the kernel catches up with a proper architectural fix.