When Atomics Cross a Cache Line Boundary: The Full Cost of x86-64 Split Locks
Source: lobsters
There is a class of x86-64 behavior that looks completely fine from userspace, passes every correctness check, and still brings multi-core systems to their knees: the split lock. A recent investigation by Chips and Cheese put these under a microscope, and the results are worth unpacking in detail, especially if you work anywhere near low-level atomic operations, memory-mapped hardware, or systems code running under a hypervisor.
What Actually Happens at the Hardware Level
The LOCK prefix on x86 guarantees that an instruction executes atomically with respect to all other processors and agents on the bus. For aligned accesses, modern Intel and AMD CPUs fulfill this guarantee cheaply using the MESIF cache coherency protocol. When your LOCK XADD or LOCK CMPXCHG targets an address fully contained within a single 64-byte cache line, the CPU acquires exclusive (M-state) ownership of that line in the cache hierarchy. The operation completes without touching the external bus at all. Other cores wait only as long as it takes to process the cache coherency handshake, which is measured in nanoseconds.
A split lock breaks this path entirely. If the operand of a LOCK-prefixed instruction straddles a 64-byte cache line boundary, even by a single byte, the CPU cannot satisfy the atomicity guarantee using cache-line locking alone. It needs to ensure that no other agent on the system can observe a partial write to either of the two affected cache lines simultaneously. The only mechanism available for this is asserting the legacy LOCK# signal, which on modern systems maps to a bus lock that serializes the entire memory subsystem.
During a bus lock, every other memory operation on every other core stalls. This is not a per-socket effect. It is a system-wide pause. The latency of a split lock on a modern server CPU can be three to four orders of magnitude higher than a normal aligned atomic. On a multi-socket NUMA system the situation is worse, because the bus lock must propagate across the interconnect.
The Instruction Set Context
The x86 architecture has carried the LOCK prefix since the 8086. In that era, a single-threaded system with a shared bus, bus locking was the only atomicity primitive available. The cache coherency based locking that makes modern atomics cheap is a later innovation, added as multi-core CPUs replaced the original bus-based shared-memory model.
The Intel Software Developer’s Manual Volume 1, in its section on the LOCK prefix, is explicit about when a bus lock is required: when the locked operation accesses a memory location that spans two cache lines, the processor will use a bus lock. This is not implementation-defined or a quality-of-implementation issue. It is architecturally guaranteed behavior.
AMD’s behavior here is slightly different and has varied across generations. Some Zen-era processors have handling that avoids the full external bus lock in certain split lock scenarios, though the behavior is not fully documented and should not be relied upon for correctness or performance guarantees.
How Linux Detects and Handles Split Locks
For years, split locks were simply allowed to happen. Userspace code that triggered them paid the performance penalty, but the kernel had no visibility into which processes were responsible. This changed with work merged around Linux 5.8, which introduced the split_lock_detect kernel feature.
The detection mechanism relies on the AC flag (bit 18 of RFLAGS) combined with the AM bit in CR0. When both are set, the processor raises a #AC alignment check fault (exception vector 17) on unaligned memory accesses. Normally this catches general unaligned accesses in user mode, but split lock detection uses it differently: the kernel sets AC in ring 0 only around LOCK-prefixed instructions, so only split locks, not ordinary unaligned loads and stores, raise the fault.
Intel also added a hardware feature called Bus Lock Detection (BLD), present from Ice Lake and Tiger Lake onward. BLD generates a #DB debug exception after a bus lock completes, giving the kernel another interception point that does not require the AC flag dance.
The split_lock_detect boot parameter accepts several modes:
off: No detection, split locks proceed normally.warn: A rate-limitedpr_warnin dmesg, nothing else.fatal: ASIGBUSis sent to the offending process or thread.ratelimit: Warns up to a configurable rate, then throttles.mitigate: The kernel disables interrupts during the split lock and re-enables them afterward, reducing system-wide impact at the cost of higher latency for that thread.
You can inspect the current mode at runtime:
$ cat /sys/kernel/debug/x86/split_lock_detect
warn
And force a mode at boot:
split_lock_detect=fatal
The fatal mode is the most useful for catching bugs during development. If your application is triggering split locks, a SIGBUS will surface the problem immediately rather than letting it silently degrade system throughput.
The Virtualization Multiplier
Split locks become substantially more expensive inside virtual machines. When a guest OS triggers a bus lock, the hypervisor must intercept it, handle the system-wide serialization, and then resume the guest. This involves a VM exit, which on modern hardware costs thousands of cycles on its own before accounting for the actual lock overhead.
KVM and Xen both have handling for this. The Linux KVM subsystem added split lock emulation so that guest split locks result in a VM exit rather than an uncontrolled bus lock that would impact host performance. From the guest’s perspective the lock still completes, but the hypervisor controls when and how, preventing a malicious or buggy guest from using split locks as a denial-of-service vector against the host.
This is not theoretical. CVE-2021-0089 describes exactly this scenario: a guest VM triggering bus locks to degrade the performance of co-located guests or the host itself. The fix required microcode updates alongside kernel changes, because the hardware detection needed to precede the bus lock completion.
Writing Code That Avoids Split Locks
The root cause is simple: a misaligned address used with an atomic operation. In C and C++, the standard way to guarantee alignment is alignas or _Alignas:
#include <stddef.h>
#include <stdint.h>
// Bad: could be placed at any alignment
uint64_t counter;
// Good: guaranteed to be 8-byte aligned, never crossing a cache line
alignas(8) uint64_t counter;
// Even better for cache performance: pad to cache line size
alignas(64) uint64_t counter;
For embedded or systems code working with memory-mapped registers or DMA buffers, alignment must be verified at the hardware protocol level, not just assumed from C layout rules. A structure packed with __attribute__((packed)) can silently produce split locks if any field receiving atomic operations ends up misaligned.
In Rust, primitive types are aligned to their own size by default, and std::sync::atomic::AtomicU64 will always be 8-byte aligned. The language makes this much harder to get wrong unintentionally, though FFI and repr(packed) structs are still capable of producing the problem.
When reviewing assembly output, look for LOCK-prefixed instructions and verify that the operand addresses are naturally aligned. GCC and Clang can emit split-lock-generating code from otherwise normal-looking C, particularly when using packed structs, bitfields with atomic updates, or when volatile is used alongside unaligned pointer casts.
What the Chips and Cheese Investigation Adds
The Chips and Cheese article goes further than the kernel documentation by actually measuring split lock behavior across different microarchitectures and quantifying the system-wide impact with real benchmarks. The key value of that kind of low-level empirical investigation is that it tells you what actually happens on silicon rather than what the architecture manual says should happen. Those two things frequently differ in ways that matter for performance-sensitive code.
Microarchitectural behavior around memory serialization is an area where vendor documentation is often incomplete. The interaction between the MESIF protocol, the ring bus or mesh interconnect, and the legacy bus lock signal involves hardware decisions that are not fully specified. Empirical testing is frequently the only way to establish ground truth.
Practical Takeaways
Split locks are infrequent in well-written modern code, but they are not rare in legacy codebases, in code that does a lot of casting through packed structures, or in hypervisor and kernel code handling hardware with unusual memory layouts. The performance penalty when they do occur is severe enough to be observable at the application level, and in multi-tenant environments the blast radius extends beyond the offending process.
Enabling split_lock_detect=warn in development and staging environments costs nothing and gives you early warning before a subtle misalignment makes it into production. For new code, using language-level alignment guarantees and keeping atomic operations on naturally aligned addresses avoids the problem entirely. The x86 architecture will continue to support split locks indefinitely for compatibility reasons, but there is no reason to trigger them accidentally.