· 7 min read ·

When an Atomic Instruction Locks the Entire Bus: Split Locks on x86-64

Source: lobsters

The x86 LOCK prefix has a reputation for being expensive, but most of the time that reputation is overblown. A LOCK CMPXCHG on a properly aligned operand that hits L1D cache costs somewhere in the 5 to 8 cycle range on modern Intel and AMD microarchitectures. The cache coherence protocol handles it entirely within the die, no external signaling required. That is the common case, and it is fast enough that most code using atomics never needs to think twice.

There is a different case, however, and it sits about two orders of magnitude slower.

The Two Paths for Atomic Operations

When the processor encounters a LOCK-prefixed read-modify-write instruction, it takes one of two paths depending on where the operand falls in memory.

If the operand fits entirely within a single 64-byte cache line, the processor uses cache line locking. It acquires that line in an exclusive state through the MESI coherence protocol, performs the read-modify-write, and releases it. The lock is purely logical, entirely local to the cache hierarchy. Other cores accessing different cache lines are unaffected.

If the operand straddles a cache line boundary, the processor falls back to bus locking. It asserts the LOCK# pin on the external memory bus, a physical signal that every other bus agent in the system must honor. All pending transactions drain. No new transactions may begin. The processor acquires both cache lines, performs the operation across the split, and only then releases LOCK#. The Chips and Cheese investigation puts the penalty at roughly 900 to 1,100 cycles on Skylake-era Intel hardware, compared to the 5 to 8 cycle aligned case.

The x86 architecture maintains this bus-locking behavior for backward compatibility stretching back to the 8086, where LOCK# was the only available mechanism because there was no cache. When Intel added L1 cache in the 486 and introduced cache line locking as an optimization for aligned accesses, they kept bus locking as the fallback for any case that cache locking could not handle cleanly. Split accesses were exactly that case, and the fallback path has survived every microarchitecture revision since.

Why the Penalty Is So Large

The 1,000-cycle figure is not just bus arbitration overhead. Several things happen in sequence before and after the actual atomic operation.

First, the store buffer must drain completely before LOCK# can be asserted. On Skylake, the store buffer holds 56 entries; Sunny Cove expanded this to 64. Every pending write must commit to the cache hierarchy before the bus locks. If the store buffer is full, this flush alone can consume 200 to 400 cycles. This is the primary contributor to the penalty on modern out-of-order machines.

Second, the processor must acquire both cache lines in an appropriate coherence state. If either half of the split operand is in Modified state in another core’s L1 or L2 cache, the coherence traffic required to transfer ownership adds latency on top of the bus assertion overhead.

Third, LOCK# release is itself a synchronization barrier. All agents that were waiting for LOCK# to deassert begin their stalled transactions simultaneously, creating a burst of bus traffic that can delay the original thread’s subsequent operations.

The global effect is worth stating plainly. Every core on the system stalls its memory operations for the duration of one thread’s split lock. On a 64-core server, a single thread executing split locks at one million per second imposes approximately 64 milliseconds of wasted core-time per second across the system. The thread paying the per-operation penalty is only part of the cost.

Linux’s Response: split_lock_detect

Intel introduced a detection mechanism in MSR_TEST_CTRL (MSR address 0x33), bit 29, starting with Tremont (Atom-class) and Tiger Lake client hardware. Setting this bit causes a #AC (Alignment Check) fault whenever user-mode code executes a split lock, independent of the EFLAGS.AC flag that controls general alignment checking.

The feature is advertised via IA32_CORE_CAPABILITIES (MSR 0xCF, bit 4). The Linux kernel added support in version 5.7, controlled by the split_lock_detect boot parameter, with four modes:

  • off: detection disabled entirely
  • warn: a rate-limited kernel warning is logged, execution continues (this became the default on supported hardware in 5.17)
  • fatal: SIGBUS is delivered to the offending process
  • ratelimit: the process is allowed up to a threshold of split locks, then SIGBUS fires

The trap handler in arch/x86/kernel/traps.c differentiates between a genuine alignment fault and a split lock #AC by checking the current sld_state and inspecting the faulting instruction. Known-legacy kernel paths that may produce split locks are whitelisted to avoid self-inflicted failures during boot.

This mechanism does not interact with SMAP (Supervisor Mode Access Prevention), which also uses the AC flag to prevent kernel code from accessing user memory. The split lock detection fires from the MSR path, not from EFLAGS.AC, so the two features coexist without conflict.

AMD has no equivalent MSR. AMD processors execute split locks with the same bus-locking penalty but provide no hardware-level interception before execution completes. There is no split_lock_detect=fatal analog for AMD; the kernel cannot catch AMD split locks before they take their full penalty and affect the system.

Virtualization and the Bus Lock VM Exit

The split lock problem has a particular edge in virtualized environments. A guest VM executing split locks at high frequency holds the memory bus long enough per operation that co-resident VMs experience measurable memory latency increases. This is not a software scheduling problem; the bus is physically asserted at the silicon level, below any hypervisor’s visibility.

Intel addressed this by introducing Bus Lock VM Exits in VMX, available starting with Tiger Lake Server and Ice Lake Xeon hardware. When enabled through bit 22 of the Primary Processor-Based VM-Execution Controls, the processor exits to the hypervisor on every guest bus lock. The VMM can log the event, throttle the vCPU, or terminate the offending guest entirely. Crucially, the VM exit penalty (roughly 1,000 to 4,000 cycles) is borne by the offending guest, making high-frequency split locks in a guest self-penalizing rather than just a nuisance to neighbors.

Linux KVM support for bus lock VM exits landed in kernel 5.16. The perf kvm stat tool was extended to expose per-VM bus lock event counts, giving host operators visibility into which guests are generating excessive bus lock traffic.

AMD’s lack of a corresponding interception mechanism means AMD-based cloud infrastructure has fewer tools for containing this behavior. The asymmetry is real and has been a consideration in hardened multi-tenant deployments.

Profiling for Split Locks

On Intel hardware with Performance Monitoring Unit support, split locks appear in the MEM_INST_RETIRED.LOCK_LOADS event (umask 0x21 on Skylake-family). This counts LOCK-prefixed load instructions that took the bus lock path rather than the cache line lock path, which is a direct count of split locks retired. Finding the callsites:

perf record -e cpu/event=0xd0,umask=0x41/ -g ./program
perf report

The specific event encoding varies by microarchitecture; consult the Intel Software Developer’s Manual Volume 3B Appendix A for the precise values for your CPU model. On AMD, there is no equivalent targeted event for bus-locked specifically split operations.

What Generates Split Locks in Practice

The most common sources are not bugs so much as inattention to alignment guarantees:

  • Structs marked __attribute__((packed)) strip padding, which can misalign fields that are later used with std::atomic or intrinsics.
  • Ring buffer implementations where head/tail counters sit at computed offsets that happen to straddle 64-byte boundaries.
  • Protocol buffer or serialization code that places atomic fields at deserialized, potentially unaligned offsets.
  • CMPXCHG16B used for 128-bit atomics where the memory is 8-byte but not 16-byte aligned, which can coincide with a cache line boundary.
  • JIT compilers that allocate memory for emitted code or data without enforcing cache-line alignment before emitting locked instructions.

Running with split_lock_detect=fatal in development, or split_lock_detect=warn with log monitoring in CI, will surface any split locks that production workloads encounter. The kernel warning includes the process name, PID, and instruction pointer, which is enough to identify the callsite without needing perf.

A Compatibility Artifact Worth Understanding

The x86 split lock behavior is a direct consequence of maintaining binary compatibility with 1978 software. The 8086 had no cache, so LOCK# was a physical bus signal for every locked instruction. When caches arrived, Intel optimized the aligned case but preserved the bus-locking fallback for everything else, including split accesses, because changing the behavior would have been a compatibility break.

Nearly fifty years later, every modern x86 processor still asserts a physical bus signal and stalls every core on the system when a single instruction crosses a 64-byte boundary. The optimization has been in place for four decades; the fallback has never needed to change because the fallback is the guarantee.

For most code this never matters. For code running on shared infrastructure, or in tight loops where even one misaligned atomic per iteration compounds, understanding that the LOCK prefix has two very different implementations depending on alignment is the kind of detail that turns a mysterious throughput regression into a five-minute fix.

Was this interesting?