· 8 min read ·

The Hidden Cost of Crossing a Cache Line: x86 Split Locks Explained

Source: lobsters

When you use std::atomic or reach for a LOCK-prefixed instruction in hand-written assembly, you probably expect the hardware to handle the details quietly. On x86-64, it usually does. But there is a specific case where it does not: when the target memory address straddles a 64-byte cache line boundary. That is a split lock, and it is one of those architectural quirks that sits quietly in production systems, occasionally causing mysterious latency spikes, and occasionally getting weaponized in virtualized environments.

The Chips and Cheese investigation into split locks puts concrete numbers on what the hardware actually does, and the results are striking enough to be worth unpacking in depth.

What Cache Locking Actually Means

To understand split locks, you need to understand what normally happens when a CPU executes a LOCK ADD [rax], eax.

Since the Pentium Pro (P6) in 1995, x86 processors have supported cache locking. If the target memory address is cached in the L1 data cache, and the operation is entirely contained within a single cache line, the processor does not need to assert any external bus signal. Instead, it uses the MESI cache coherence protocol: the cache line transitions to Modified or Exclusive state on the locking core, other cores’ snoop requests stall briefly, the operation completes, and nobody outside the local cache hierarchy ever knows it happened. The cost is roughly comparable to a cache-hit load: somewhere between 10 and 40 cycles depending on microarchitecture and coherency state.

This is documented in the Intel Software Developer’s Manual, Volume 3A, Section 8.1.4:

“For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained within a cache line, the processor may not assert the LOCK# signal on the bus.”

The operative phrase is “completely contained within a cache line.” When an operation crosses a boundary, that guarantee evaporates.

What Happens When You Cross the Boundary

Consider a 4-byte atomic add targeting address 0x3E (decimal 62). The operation reads and writes bytes 62 through 65. On a 64-byte cache line layout, byte 63 is the last byte of one line and byte 64 is the first byte of the next. No single cache line can contain both bytes 62 and 65; the operation spans two.

The processor cannot cache-lock two lines simultaneously. Instead, it falls back to bus locking: asserting the LOCK# signal (or its modern interconnect equivalent) to freeze the entire memory subsystem for the duration of the operation. Both cache lines are fetched. The read-modify-write completes across both. The lock releases. Normal operation resumes.

The latency penalty for this is not a small constant. Empirical measurements place split lock latency at roughly 60 to 100 times the cost of an equivalent aligned operation on modern Intel hardware. Where an aligned LOCK XCHG might cost 25 to 40 cycles, a split lock can cost 1,500 to 2,500 cycles on the same core.

That alone would be a curiosity worth noting. What makes split locks genuinely dangerous is what “bus locking” means in a multi-core context.

The System-Wide Stall

When the LOCK# signal is asserted on a modern processor, the scope is not limited to the locking core. The entire memory subsystem serializes. Every other core on every socket must drain its pipeline and pause memory operations. DMA transfers pause. Interrupt delivery may be delayed. The memory controller services only the locking transaction until the signal deasserts.

The implication scales brutally with core count. On a 2-core system, one split lock stalls one other core. On a 64-core system, it stalls 63. On a modern dual-socket server with 96 cores total, a single misaligned atomic in a hot path can waste roughly 95 core-cycles of useful work for every cycle it occupies the bus.

This is not a hypothetical edge case. A tight loop issuing split locks at moderate frequency, say millions of operations per second from a single thread, can collapse overall system memory throughput by 50 to 90 percent on a loaded multi-core machine. The thread doing the split locks may appear to be running normally from its own perspective, while every other thread suffers.

How Split Locks Happen in Practice

In most codebases, split locks do not appear intentionally. The common sources are:

  • Struct packing or #pragma pack that removes natural alignment guarantees from fields used in atomic operations
  • Memory allocators that satisfy size but not alignment requirements, particularly for small allocations
  • Code ported from 32-bit x86 where alignment assumptions from narrower data types no longer hold
  • Stack allocations for types that get used atomically without explicit alignment attributes

The XCHG instruction is worth specific mention because it carries an implicit LOCK even without the explicit 0xF0 prefix. Code using XCHG for lightweight locking can trigger split locks without the programmer ever writing the LOCK prefix.

For most programs, split locks in cold paths are not worth worrying about. The problem arises when they appear in hot paths: spin loops, frequently-contested mutexes, reference counting, anything that executes millions of times per second.

Twenty-Four Years Without a Detection Mechanism

Intel introduced cache locking in 1995. For the next 24 years, split locks silently incurred the full bus lock penalty with no OS-visible mechanism to detect them. The hardware would simply perform the slow operation, stall all other cores, and continue. No counter incremented in the PMU that would flag the occurrence at the OS level. No exception. No signal to the process.

This was tolerable in the era of single-core processors, where bus locking was slow but harmless: no other thread existed to starve. With the shift to multicore CPUs after 2005, the DoS potential emerged gradually. The virtualization explosion of the 2010s made it concrete: a malicious or buggy guest VM issuing split locks in a tight loop could degrade all co-resident VMs on the same physical host, with no recourse for the hypervisor.

Intel added hardware split lock detection to the Tremont microarchitecture (2019) and Tiger Lake (2020), the first mainstream client part to include it. The mechanism is a control bit in MSR_TEST_CTRL (address 0x33), bit 29, labeled SPLIT_LOCK_DETECT. When set, the CPU raises a #AC (Alignment Check) exception rather than silently proceeding with the bus lock. No bus lock occurs if the detection fires; the exception handler gets to decide what to do instead.

Linux merged support for this in kernel 5.8 (2020), exposing a split_lock_detect= kernel parameter with several modes:

  • off: disabled, legacy behavior; split locks proceed silently
  • warn: log a kernel warning (rate-limited) and allow the process to continue
  • fatal: send SIGBUS to the offending process
  • ratelimit: limit warning frequency to avoid log flooding

The kernel code lives primarily in arch/x86/kernel/cpu/intel.c (MSR configuration and CPU feature detection) and arch/x86/kernel/traps.c (the #AC exception handler). For kernel-mode split locks, the response is a hard BUG(): any split lock in kernel code is a bug, not something to warn about and continue.

The documentation is in Documentation/x86/split_lock.rst in the kernel tree.

The Virtualization Complication

Enabling strict split lock detection in a hypervisor is not straightforward. When the host kernel sets SPLIT_LOCK_DETECT and a guest VM triggers a split lock, the host receives the #AC exception. KVM must trap this and inject it into the guest, allowing the guest OS to apply its own policy.

This created real compatibility problems when the feature was first deployed. Windows guests, particularly when running 32-bit applications under WOW64, contained legitimate (from their perspective) code paths that generated split locks. Enabling fatal mode on a Linux KVM host would crash these Windows guests. Microsoft addressed many of these in subsequent Windows updates, but the episode illustrates how deeply split lock behavior was baked into existing software.

Intel classified the split lock DoS vector as a moderate severity security issue; the relevant advisory is INTEL-SA-00614. The recommended mitigation for cloud providers is enabling split_lock_detect in the hypervisor configuration and ensuring guest VMs receive the #AC for their own handling.

What Other Architectures Do Differently

The split lock problem is specific to x86 because of a particular design commitment: the ISA does not require alignment for locked operations. The LOCK prefix guarantees atomicity regardless of address alignment. That guarantee, combined with 64-byte cache lines, creates the split lock scenario structurally.

ARM64 takes the opposite approach. Exclusive load/store instructions (LDXR/STXR) require natural alignment: a LDXR on a 4-byte register must be 4-byte aligned; on an 8-byte register, 8-byte aligned. An unaligned exclusive access is architecturally undefined and raises an alignment fault. The load-linked/store-conditional model means there is no global bus lock mechanism to fall back to; the exclusive monitor simply tracks which cache line the reservation covers. Split locks cannot occur because misaligned exclusive accesses are not permitted.

RISC-V follows the same pattern. The LR.W/SC.W (Load Reserved/Store Conditional) pair requires natural alignment; an unaligned LR.W raises a Store/AMO Address Misaligned exception (cause code 6). The AMO instructions (AMOADD.W, AMOSWAP.D, and so on) likewise require alignment. No split lock analog exists.

POWER, SPARC, and every other major RISC architecture enforce alignment for their atomic primitives at the ISA level. x86 is the outlier because its backward compatibility heritage extends to the original 8086, where the LOCK prefix predates cache hierarchies entirely. The semantics were defined in terms of a physical bus signal, and that semantic contract has been preserved ever since.

What to Do With This Information

For most application developers, the practical action is straightforward: ensure that any data used in atomic operations, especially in hot paths, is naturally aligned. In C and C++, this usually means avoiding #pragma pack on structs that contain atomic fields, using alignas where needed, and being cautious with memory allocators that do not guarantee alignment beyond the allocation size.

For system programmers and infrastructure engineers running Linux 5.8+ on Tremont or Tiger Lake or newer hardware, enabling split_lock_detect=warn in the kernel command line provides a low-cost way to surface existing split locks in production workloads. The PMU counters MEM_INST_RETIRED.SPLIT_LOCKS (on Intel microarchitectures that expose it) offer another instrumentation path for performance investigations.

For hypervisor operators, the security angle warrants attention. A guest VM that issues split locks in a tight loop can degrade host performance measurably even with the detection MSR not set. Enabling the detection and configuring appropriate guest behavior is the right posture for multi-tenant environments.

The Chips and Cheese investigation provides the empirical grounding for what the penalty actually looks like across specific microarchitectures, including the per-core interference measurements that confirm the system-wide scope of the stall. The numbers are consistent with what the architecture docs predict: bus locking is a global serialization point, and the cost grows with the number of cores being starved. For a hardware behavior that has existed since the mid-1990s, it took a surprisingly long time to get tooling that makes it observable from the OS level. The detection infrastructure that finally landed in Linux 5.8 closes a gap that has existed for most of the multicore era.

Was this interesting?