When Your Linux Kernel Becomes the Bottleneck: Container Density and the Mount Namespace Problem
Source: lobsters
Netflix’s recent Mount Mayhem post is one of those performance investigations that starts with an innocuous symptom, “why does throughput plateau before we run out of CPU?” and ends somewhere deep in the Linux kernel’s filesystem code. The short answer is mount namespaces. The longer answer involves seqlocks, NUMA cache-line contention, OverlayFS mount multiplication, and a set of kernel data structures designed before anyone imagined running hundreds of isolated containers per host.
This post is not a summary of the Netflix article. It is an attempt to explain the kernel mechanics that make this problem hard, trace how the Linux community has been responding, and think through what it means for people building systems around container density.
The VFS Mount Table Is Older Than Your Container Runtime
Every path lookup on Linux, every open(), stat(), or execve(), passes through the Virtual Filesystem layer. Part of that traversal involves checking the mount table: a kernel-internal hash table called mount_hashtable (in fs/namespace.c) that maps (parent vfsmount, dentry) pairs to child mount structs. This is how the kernel knows that /proc is a procfs mount and /dev/shm is a tmpfs, even when everything looks like one continuous tree from userspace.
The global mount_hashtable is protected by mount_lock, a seqlock. Seqlocks are a clever trick: readers do not block writers, but they do check a sequence counter before and after reading, retrying if a write happened mid-read. The assumption baked in here is that writes are rare and reads are cheap to retry. On a system with four cores running a handful of processes, that assumption holds.
On a 2-socket AMD EPYC 9654 with 192 hardware threads running 300 containers, it falls apart completely.
Why NUMA Turns Seqlocks Into Serialization Points
A seqlock’s sequence counter is a single memory location. When any CPU writes to it, the cache line containing it is invalidated everywhere. Every other CPU that was in the middle of reading must retry. On a single NUMA node, this is a few nanoseconds. Across NUMA nodes, invalidating a shared cache line requires a cross-socket coherence message, which on modern Intel and AMD platforms costs roughly 100-300ns depending on topology.
With hundreds of containers each issuing path lookups constantly, the mount_lock seqlock becomes a cross-NUMA serialization point. CPUs on socket 1 are continuously invalidating cache lines that CPUs on socket 0 are trying to read, and vice versa. The kernel’s sys time climbs while user work stagnates. A perf record profile shows time accumulating in __legitimize_mnt, mnt_want_write, and lock_mount, none of which correspond to any actual work being done.
This is sometimes called a false sharing problem at the architectural level: nothing is logically wrong, but the hardware coherence protocol is paying a tax that scales with CPU count rather than with work done.
OverlayFS Makes the Problem Worse
Containers typically use OverlayFS to layer a read-write upper layer on top of a read-only image lower layer. From the kernel’s perspective, a single overlay mount is not one mount struct: it is several. There is the overlay mount itself, plus the underlying mounts for each layer. A container with a multi-layer image can contribute five, ten, or more mount structs to the mount_hashtable.
Multiply that across 300 containers and you have potentially thousands of entries in the hash table. Hash chains grow longer. Lookups that should be O(1) become O(n) walks under the seqlock. The per-mount mnt_count reference counts, touched on every mnt_get and mnt_put, are themselves individual atomic variables that bounce between CPU caches.
There is also a secondary effect. Tools that read /proc/self/mountinfo to understand the mount topology, systemd, findmnt, container runtimes performing setup, become proportionally slower as mount count grows. A namespace with 3000 mounts produces a /proc/self/mountinfo that takes non-trivial time just to parse, and anything that does this on a hot path will show up in latency profiles.
The Kernel’s Response: Christian Brauner’s Mount API
The Linux kernel community has been aware of mount namespace scalability problems for years. Christian Brauner, who now co-maintains the VFS subsystem, has been the most active contributor to rethinking how mounts work.
The new mount API, introduced in kernel 5.2, replaces the old mount() system call with a set of file-descriptor-based operations: fsopen(), fsconfig(), fsmount(), and move_mount(). The fd-based approach enables atomic configuration of complex mount trees before they become visible, reducing the window during which intermediate states can trigger lookups against partially-constructed namespaces.
More directly relevant, Brauner’s work on mount ID uniqueness and namespace handling has addressed some of the reference counting overhead. Al Viro, the long-time VFS maintainer, has made incremental improvements to mnt_want_write to reduce atomic operations on the read path.
Kernel 6.8 introduced improvements to how mount namespace IDs are allocated and tracked, reducing some per-namespace overhead. These are not dramatic rewrites; they are careful, incremental changes to code that thousands of production systems depend on.
Practical Mitigations That Don’t Require Kernel Patches
Netflix’s investigation, like most good systems debugging writeups, is useful precisely because it documents what actually worked before waiting for upstream fixes.
The most impactful option where security policy permits is sharing mount namespaces across containers on the same host. If containers share a namespace, there is one set of mount structs rather than N, and the hash table pressure scales with distinct images rather than container count. This trades isolation for density, which is a real trade-off, but it is the right trade-off for workloads where the containers are trusted or where the isolation boundary is enforced at another layer.
Reducing the number of bind mounts and overlay layers per container also helps. A container image with three layers instead of ten contributes meaningfully fewer entries to the mount_hashtable. Build practices that minimize layer count, using multi-stage builds and merging RUN steps, have a direct performance consequence at container density.
Using pivot_root instead of chroot in container setup is a smaller but consistent improvement. pivot_root properly updates the namespace’s root mount pointer, whereas chroot changes the process’s view without adjusting what is visible to the VFS mount traversal code.
Tuning container runtimes, containerd, runc, crun, to avoid gratuitous bind mounts of host paths is another lever. Default runtime configurations often bind-mount things like /etc/hostname, /etc/resolv.conf, and /etc/hosts as separate mounts. Collapsing these into the image or using tmpfs overlays instead of bind mounts reduces mount count.
What This Reveals About Container Scheduling Assumptions
The deeper issue here is not a bug. The Linux VFS mount machinery works correctly. The problem is that the design assumptions about mount table size and lock contention rate were made in an era when a single machine ran tens of processes, not hundreds of isolated containers each with their own namespace.
Container orchestration systems like Kubernetes schedule based on CPU and memory requests. They do not account for mount namespace pressure, mount_hashtable chain length, or cross-NUMA seqlock traffic. A host can appear to have available capacity by all conventional metrics while actually being saturated at the kernel VFS layer. This is a gap between the abstraction that container schedulers operate on and the physical constraints of the hardware they schedule against.
The Netflix findings suggest that container density should be modeled not just as CPU and memory utilization but as a function of mount namespace complexity per container times container count per host. Nobody has good tooling for this yet. perf can show you the symptom; attributing it to mount namespace pressure requires knowing what to look for.
For anyone building container platforms on modern multi-socket systems, the practical takeaway is to instrument for kernel sys time separately from user time, to track mount count per namespace and total mounts per host, and to validate container density targets on real hardware that matches production NUMA topology rather than assuming linear scaling from development environments.
The kernel will keep improving. Brauner and the VFS team are doing serious work. But the gap between the kernel that shipped with your distro and the kernel with those improvements is usually measured in months to years, and the systems running your workloads do not always get upgraded on a comfortable timeline. Understanding what the kernel is doing underneath your container runtime is not optional when you are pushing density hard enough to find these edges.