Container Density and the Third Wave of Linux Global Lock Bottlenecks
Source: lobsters
The Netflix Tech Blog recently published a detailed investigation into why container workloads on 192-core AMD EPYC hosts burned CPU in ways that defied conventional metrics. The investigation ended at mount namespace lock contention: a global read-write semaphore in fs/namespace.c that serializes all mount operations across the entire system. The specific mechanism is novel to most readers, but the shape of the problem, its cause, and the trajectory of the fix are recognizable. This exact scenario has played out before in Linux kernel history, at least twice, and the playbook is consistent each time.
The Big Kernel Lock: The Archetype
When Linux gained symmetric multiprocessing support in version 2.0, the kernel needed mutual exclusion between CPUs. Before SMP, kernel code ran on one processor and disabling interrupts was sufficient. Adding more CPUs required something broader, and the answer was the Big Kernel Lock: a single global spinlock that any kernel code path could acquire when it needed to do something that had to be serialized. The BKL was explicitly framed as transitional scaffolding, to be replaced incrementally with per-subsystem locking as code was audited.
That audit took roughly 15 years. The last BKL users were removed in Linux 2.6.39 in 2011. The removal was not a single heroic refactor but a years-long campaign by Arnd Bergmann, Ingo Molnar, and many others, working through hundreds of individual code paths. What made it hard was not the technical complexity of any single removal, but the number of places where code had quietly depended on the global lock without saying so. Entire subsystems had correctness properties that only held because the BKL excluded concurrent access, and discovering this required reading code that had not been touched in years.
The BKL established the pattern: a global coarse lock, invisible in low-contention scenarios, catastrophic when workload density scales beyond its design assumptions, and slow to eliminate because the fix requires auditing every code path that relies on it.
Cgroups v1: The Pattern’s First Container-Era Recurrence
Control groups arrived in Linux 2.6.24 to provide resource accounting and enforcement for groups of processes. The v1 design gave each resource controller, CPU, memory, I/O, block, and others, its own separate hierarchy. The data structure binding tasks to their cgroup membership was css_set, protected by css_set_lock, a global spinlock. Fork, exec, task migration between cgroups, and cgroup creation and deletion all required holding this lock in write mode.
For the workloads of 2008, this was fine. A deployment might have a handful of cgroup subtrees for a few server processes, with task migration happening rarely. Contention was negligible and the lock did not appear in profiles.
Container runtimes rewrote the usage pattern. Each container required its own cgroup subtrees, one per controller. A host running 500 containers had thousands of cgroup objects and a constant stream of creation and deletion events during rolling deployments. On high-core-count hosts, css_set_lock write contention became a dominant bottleneck with the characteristic profile that global locks produce under this condition: high contention count, short average hold time, very long average wait time. The lock itself was fast to acquire when uncontested; the problem was the depth of the queue waiting for it.
The response was cgroups v2, a redesign that unified the hierarchy structure and rethought the locking model around per-cgroup granularity rather than a global per-subsystem spinlock. The design work started appearing in Linux 3.16 in 2014. A stable, production-usable v2 implementation arrived around Linux 4.5 in 2016. Container runtimes began migrating between 2019 and 2021. Kubernetes defaulted to cgroups v2 in version 1.25, released in 2022. The timeline from “the kernel has a better design” to “this is the default in major container orchestrators” was approximately eight years.
Cgroups v1 is still present in current kernels. The migration is not complete.
Mount Namespaces: The Current Instance
Mount namespaces appeared in Linux 2.4.19 in 2002 as the mechanism that gives each container its own view of the filesystem tree. The locking model used namespace_sem, a global read-write semaphore in fs/namespace.c, to protect all namespace modification operations: mount(2), umount(2), unshare(CLONE_NEWNS), and clone(CLONE_NEWNS). Write-locking the semaphore for the duration of a mount or namespace clone was correct and reasonable when a system might have a handful of isolated environments and mount operations were infrequent.
On a 192-core EPYC host running 400 containers, traffic-driven scaling events trigger thousands of mount operations, all serializing through the same global semaphore. Container initialization alone involves creating an overlayfs mount, binding /dev, /proc, and /sys subtrees, mounting volumes and secrets, and setting up devpts and tmpfs entries. A typical container startup sequence acquires namespace_sem for a write-locked operation ten to thirty times. With hundreds of containers starting concurrently, the queue of threads waiting for write access to the semaphore becomes the primary consumer of CPU time rather than any application work.
The NUMA topology of modern servers compounds this significantly. The cache line holding namespace_sem lives on one NUMA node. Cores on a remote socket pay 100-300 nanoseconds of additional latency per lock acquisition, on top of the contention wait time. Adding CPU cores worsens the problem rather than helping, because each new core adds another waiter to the queue and increases the rate of cross-socket cache line invalidations.
A perf lock record session followed by perf lock report surfaces this immediately. The signature is unambiguous: namespace_sem appears near the top of the contention list with write contention counts in the hundreds of thousands, average wait times an order of magnitude longer than hold times.
The Fix Follows the Same Two Stages
The Linux kernel’s response to this class of problem follows a consistent trajectory, visible now across all three instances.
The first stage is a new API that reduces critical section scope for new callers without breaking existing ones. For mount namespaces, this is the new mount API introduced in Linux 5.2: fsopen(), fsconfig(), fsmount(), and move_mount(). These syscalls decompose the old monolithic mount(2) call into stages where configuration and instantiation happen outside the global lock scope, and only the final attachment step requires brief write access. Container runtimes that adopt the new API spend a fraction of the time holding namespace_sem compared to legacy mount(2) callers. The analogous first stage for cgroups was the early v2 implementation that existing tools could ignore while new tooling was built around it.
The second stage is converting the global lock to per-object granularity. For cgroups, this was the full v2 redesign with per-cgroup locking. For mount namespaces, Christian Brauner’s work across Linux 6.8 and 6.9 converted namespace_sem from a system-wide semaphore to per-namespace locking, so that mount operations in independent containers no longer contend with each other at all. The mailing list work for this spanned multiple development cycles and required careful audit of namespace cloning semantics, pivot_root behavior, and bind mount propagation modes, all of which had correctness properties that depended on the old global lock scope.
The second stage always takes longer, and the gap between landing in mainline and reaching production is substantial. Most container fleets run 5.15 or 6.1 LTS kernels. The 6.8 improvements will not reach those fleets on any short timeline.
The Gap That Keeps Mattering
Netflix’s investigation is useful not only for documenting the specific bottleneck but for illustrating a structural gap in how container density is modeled. CPU and memory utilization metrics do not capture kernel subsystem pressure. A host can report available capacity by every conventional signal while being fully saturated at the VFS locking layer. Container schedulers like Kubernetes allocate based on resource requests that have no concept of mount namespace write lock contention or cross-NUMA semaphore queue depth. The abstraction used for scheduling and the physical constraints of the hardware underneath it are not aligned for this class of problem.
The practical signals are available. Elevated kernel %sys time that does not correspond to application work is the leading indicator. perf lock report will name the lock. Mount count per namespace and total active mounts per host are trackable metrics, and both correlate directly with the severity of the bottleneck under load.
The mitigation path before the 6.8 fixes reach production is the same one Netflix documents: reduce bind mount count per container, stagger container startup concurrency, and use MNT_DETACH on unmount to shorten write lock hold time. None of these are elegant solutions; they are adaptations to a kernel design that has not yet been updated to match current hardware.
The pattern itself suggests that this will happen again. The mount namespace locking problem was invisible at 4-core container density and catastrophic at 192-core density. Other kernel subsystems have global or coarse-grained synchronization primitives built for the hardware of their era: process namespace operations, network namespace creation, and seccomp filter application all have locking models that have not been stress-tested at the concurrency levels that current hardware makes possible. The next post-mortem in this lineage has not been written yet, but the conditions that will produce it already exist.