· 6 min read ·

Container Runtimes Have a Better Mount API. Most Aren't Using It Yet.

Source: lobsters

The Netflix Tech Blog’s Mount Mayhem post traces a performance cliff to global kernel locks in the mount namespace subsystem: namespace_sem, mnt_id_lock, and the mount_lock seqlock all serialize in ways that become catastrophic on 192-core NUMA machines running hundreds of containers. The root cause is well described there. What the article leaves implicit is that part of this problem flows directly from the design of the old mount(2) syscall, and that Linux has shipped a structurally better alternative since kernel 5.2. The runtime adoption story is where things get interesting.

Why mount(2) Forces Long Lock Hold Times

The legacy mount(2) syscall packs configuration and attachment into a single atomic step. The kernel makes the mount visible in the namespace tree while holding namespace_sem for write, which means the lock scope covers everything: parsing option strings, opening underlying directories, initializing the filesystem’s inode cache, inserting into the mount hash table, and allocating a new mount ID through the global mnt_id_lock spinlock. The API design forces this. Because there is no way to separate “set up the filesystem” from “attach it to the namespace,” the write lock has to cover both.

For a single OverlayFS mount, that critical section might run for tens of microseconds. For 50 containers starting concurrently, each requiring 8 to 12 mounts (an overlay rootfs, bind mounts for /dev, /proc, /sys, and application volumes), you sustain a period where the write lock is held almost continuously. Cores queue behind each other across NUMA nodes, and the cross-socket coherence traffic for the lock cache line amplifies the wait time further. The CPUs are not doing useful work; they are waiting for a lock to free up so they can spend a few microseconds holding it.

The Fd-Based Replacement

Linux 5.2 introduced six new syscalls that decompose the old mount(2) operation into stages. The kernel documentation covers the mount API in detail, and the patchset from David Howells makes the design rationale explicit.

The key syscalls:

  • fsopen(2): Open a filesystem context for a named type. Returns a file descriptor. No namespace locks acquired.
  • fsconfig(2): Set attributes on the context through the fd. No namespace locks.
  • fsmount(2): Instantiate a mounted filesystem object from the configured context. Returns a mount fd. The filesystem is fully initialized here, but not yet visible in any namespace tree.
  • move_mount(2): Attach the mount fd to a location in the namespace. This is the only step that requires a brief namespace_sem write acquisition.
/* Old API: full setup and attachment under write lock */
mount("overlay", "/container/rootfs", "overlay", 0,
      "lowerdir=/layer1:/layer2,upperdir=/upper,workdir=/work");

/* New API: setup outside the lock, attachment inside */
int fs_fd = fsopen("overlay", FSOPEN_CLOEXEC);
fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir", "/layer1:/layer2", 0);
fsconfig(fs_fd, FSCONFIG_SET_STRING, "upperdir", "/upper", 0);
fsconfig(fs_fd, FSCONFIG_SET_STRING, "workdir", "/work", 0);
fsconfig(fs_fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
int mnt_fd = fsmount(fs_fd, FSMOUNT_CLOEXEC, 0);

/* Only this call acquires namespace_sem write, briefly: */
move_mount(mnt_fd, "", AT_FDCWD, "/container/rootfs",
           MOVE_MOUNT_F_EMPTY_PATH);
close(mnt_fd);
close(fs_fd);

The locking consequence is substantial. With the old API and 50 containers starting in parallel, each with 10 mounts, you have 500 write-lock acquisitions each covering the full setup duration. With the new API, you still have 500 move_mount calls that need the write lock, but each critical section covers only the tree attachment. The filesystem initialization work, which dominates the old lock hold time, happens concurrently across all cores without competing for namespace_sem. Mount ID allocation via mnt_id_lock can also be moved outside the primary attachment path when the new API is used, reducing contention on that spinlock as well.

Runtime Adoption

The new API has been available since kernel 5.2, released in July 2019. Six years is a long time for production adoption to remain incomplete.

runc, the reference OCI runtime, still uses legacy mount(2) in its primary path for most mount types. There are open issues and early patches on the runc repository discussing the migration, but as of early 2025 the default build path continues to call the old syscall. Part of the friction is the OCI runtime spec itself: the spec’s mount configuration structure maps naturally to mount(2)’s argument layout, and translating it to the fd-based API requires runtime-side logic that has no spec-level definition yet.

crun, the C-based OCI runtime developed primarily by Giuseppe Scrivano at Red Hat, has progressed further. Where the underlying libmount library integrates with the new syscalls, crun takes advantage of it. Scrivano has been one of the more active contributors pushing new-API adoption in userspace tooling. Beyond the mount API, crun avoids the Go runtime overhead that runc carries; the per-container startup cost is lower, which means less time competing for locks even on the legacy path.

containerd delegates actual mount operations to whichever OCI runtime it invokes, so its behavior tracks runc or crun depending on configuration. The containerd project has done significant work on OverlayFS snapshot management and image unpacking, but the mount attachment path in its runc-based configuration still goes through the old syscall.

Podman and its libpod layer present the same picture: when using crun as the runtime, you get whatever improvements crun provides; when using runc, you do not.

The kernel requirement is one blocker. The new API needs kernel 5.2 at minimum. Production fleets on Amazon Linux 2 (kernel 5.4 baseline) have it available; fleets on 4.19-based distributions do not. Most enterprise distributions in active use as of 2025 ship kernels where the new API is present, so this is less of a constraint than it was in 2020, but runtime code still needs to handle fallback to mount(2) for older kernels, which adds implementation complexity.

Sandboxed Runtimes as an Architectural Alternative

Two other runtimes handle this differently at the architectural level rather than the syscall level.

gVisor runs each container with a user-space kernel, called the Sentry, that implements the Linux syscall ABI. Mount operations inside a gVisor container are handled by the Sentry’s own VFS, not by the host kernel’s VFS. The host kernel sees a single process per container rather than per-container mount namespace entries in its mount_hashtable. The host’s namespace_sem, mnt_id_lock, and mount_lock seqlock do not see traffic proportional to container mount count.

The trade-off is syscall interposition overhead: every syscall crosses the gVisor layer, which costs roughly 2 to 3 times native for I/O-heavy workloads. For CPU-bound or network-bound containers on high-density hosts where mount namespace contention is the primary constraint, gVisor’s architecture removes that constraint entirely. It is not the right choice for all workloads, but the performance profile is not uniformly worse.

Kata Containers takes a different approach: each container runs in a lightweight VM with its own kernel. The host sees virtio block device accesses, not mount namespace operations. The per-container VM overhead is higher than gVisor (primarily in memory), but the host VFS lock contention disappears by construction.

Neither is a drop-in replacement for runc in the general case. But they demonstrate that the lock contention is not inherent to container isolation. It is a consequence of shared-kernel namespace isolation combined with an API that forces long lock hold times.

The Practical Path Forward

For operators on kernels 5.2 or later who want improvement without waiting for kernel 6.8, switching from runc to crun as the container runtime is the most accessible option that does not require kernel modifications. Combined with aggressive reduction of per-container mount count, most of the userspace mitigation potential is reachable.

Minimizing per-container mount count deserves emphasis because it affects both old and new API paths. Every bind mount avoided is one fewer move_mount (or legacy mount) call. Default runtime configurations frequently add individual bind mounts for /etc/hostname, /etc/resolv.conf, and /etc/hosts. Consolidating these into the image, or using tmpfs overlays that merge into fewer mounts, reduces both the lock acquisition count and the time spent in namespace_sem per container startup.

The kernel-side fix from Christian Brauner’s work in 6.8 provides per-namespace locking and is the most complete solution. The migration path from “merged in 6.8” to “running in your production container hosts” is typically 18 to 36 months on enterprise distributions. The fd-based mount API has been available for six years and still has incomplete runtime adoption. Both gaps reflect how kernel infrastructure improvements actually reach production systems: slowly, through distribution releases and runtime migration work, not through kernel version upgrades alone.

For platform teams planning capacity on modern multi-socket hardware, the mount contention problem is predictable. Container count, per-container mount count, and kernel version together determine whether it is visible. Measuring kernel sys time during container burst scenarios before reaching production density, rather than after, is the most useful thing to do with this information.

Was this interesting?