Split Locks and the Performance Tax Hidden in x86's Compatibility Story
Source: lobsters
When a locked x86 instruction straddles a cache line boundary, the processor cannot perform the atomic operation through normal cache coherency mechanisms. Instead, it falls back to asserting a physical bus lock, stalling every other memory agent on the system until the operation completes. The penalty runs to thousands of cycles rather than tens, and its effects extend across every core and socket on the system.
Chips and Cheese investigated this across multiple Intel and AMD microarchitectures, measuring how the penalty scales with contention and differs between vendors. The findings confirm the theoretical picture with concrete numbers. Before getting into what they found, it helps to understand why the fallback mechanism exists and why it has survived essentially unchanged from the 1980s to the present.
The Architecture of Locked Operations
Cache lines on x86-64 are 64 bytes, aligned to 64-byte boundaries. A locked operation on 8 bytes starting at address 0x38 touches the range 0x38-0x3F, entirely within one cache line. The same operation at 0x3C touches 0x3C-0x43, crossing the boundary at 0x40 into a second cache line. This is a split lock.
For the normal (single-line) case, the processor uses cache locking: it brings the relevant line to the Modified state via the MESI coherency protocol, performs the atomic read-modify-write, and releases it. This has been the behavior since the P6 architecture; the Intel Software Developer’s Manual, Volume 3A, Section 8.1.4 describes it as operating “on the processor’s caches rather than the system bus.” Other cores contending for the same line stall, but only because of coherency, and the operation produces no signal on the external bus.
For the split-lock case, cache locking is architecturally impossible. Two lines cannot be atomically owned exclusively through the normal coherency protocol. The fallback is bus locking: the processor asserts the LOCK# signal (or its point-to-point equivalent in modern Intel interconnects) and holds it for the entire duration of the read-modify-write. The SDM is direct about this: “If a LOCK operation specifies a memory location that is not contained within a cache line, the entire bus is locked during the operation.” This behavior is preserved to maintain backward compatibility with software written for much older Intel processors, where bus locking was the only mechanism and alignment was simply a matter of programmer discipline.
How Much It Costs
A properly aligned LOCK XADD or LOCK CMPXCHG costs roughly 20 to 40 cycles in an uncontended scenario on a modern Intel core. The split-lock equivalent costs on the order of 1,000 to 2,000 cycles or more, a factor of 25 to 100 depending on the microarchitecture and how many other threads are active. The Chips and Cheese measurements hold this range across multiple Intel generations; the exact figure varies with cache hierarchy and memory bandwidth, but the order of magnitude is consistent.
The more important figure is the system-wide stall. During a bus lock, every core waiting to perform a memory operation is blocked. On a multi-socket system, the stall propagates across the interconnect fabric. A single thread repeatedly issuing split locks can measurably degrade throughput across all other threads on all cores. Cloud providers have documented noisy-neighbor incidents where one tenant’s workload caused performance degradation across co-located VMs on the same physical host, which is one reason the Linux kernel eventually grew mechanisms to detect and restrict the behavior.
AMD’s handling is architecturally interesting. Zen-generation processors manage split-locked accesses differently at the microarchitectural level, and the Chips and Cheese measurements show that AMD does not impose the same system-wide stall. The penalty on AMD, while still substantially higher than an aligned operation, is more contained. AMD does not implement Intel’s #AC-on-split-lock detection mechanism, so the difference surfaces through performance counters rather than kernel-level policy. AMD’s public manuals do not formally document the microarchitectural reason, but the performance measurements make the divergence observable across generations.
The Linux Kernel’s Slow Response
The Linux kernel had no mechanism to detect or mitigate split locks in user-space code until Linux 5.7, released in May 2020. The feature was driven largely by work from Tony Luck at Intel, and it relies on hardware support introduced in Intel’s Tremont and Tiger Lake microarchitectures: bit 29 of the MSR_TEST_CTRL register (SPLIT_LOCK_DETECT), which causes the processor to raise a #AC (Alignment Check) exception instead of silently proceeding with a bus lock when a misaligned locked operation is attempted.
The kernel exposes this via the split_lock_detect= boot parameter. The warn mode emits a rate-limited kernel message while allowing the operation to continue. The fatal mode sends SIGBUS to the offending process. A ratelimit:N mode throttles rather than kills, allowing at most N split locks per second before slowing the process down.
The rollout presented immediate compatibility problems. When some distributions enabled split_lock_detect=fatal by default, it broke applications running under Wine that contained misaligned atomic operations from compiled Windows binaries. Older DirectX titles, DRM systems, and legacy software crashed with SIGBUS. LWN covered the fallout in 2021 as the kernel community worked through the compatibility implications. The eventual consensus settled on warn as a safer default for distributions with broad user bases. The episode illustrates the persistent difficulty of enforcing correctness on a platform carrying forty years of compatibility expectations.
The fatal mode also has a security dimension. A process that can execute arbitrary code can use split locks as a denial-of-service vector, issuing bus-locking operations in a tight loop to stall every other core. Container environments and cloud hypervisors have real motivation to run with fatal or ratelimit modes enabled. KVM gained support for emulating the detection MSR around Linux 5.10 to 5.12, so guest kernels could apply the same policy to their own user-space processes.
Avoiding Split Locks in Practice
The most common source of split locks in compiled code is packed structures combined with atomic operations.
// The danger pattern:
struct __attribute__((packed)) Shared {
char version;
int32_t counter; // sits at offset 1, not 4
};
// Depending on where this struct lands in memory,
// a LOCK ADD on 'counter' may cross a cache line.
Standard C++ std::atomic<T> is required by the specification to be naturally aligned, with alignof(std::atomic<T>) >= sizeof(T). Conforming compilers ensure that std::atomic<int64_t> is 8-byte aligned and will never produce a split lock from standard atomic code. The problem arises when developers combine packed structures with manual atomic operations, or when compiling legacy code that predates standard alignment rules.
Detection on recent Linux is straightforward. Intel performance counters expose bus lock events; on microarchitectures from Ice Lake onward, the event is accessible directly:
perf stat -e cpu/event=0xf4,umask=0x10/ ./my_program
A non-zero count means bus locks occurred. On systems with split lock detection hardware, booting with split_lock_detect=warn and monitoring dmesg will identify the offending process and instruction pointer. Intel VTune’s Microarchitecture Exploration analysis also surfaces bus lock hotspots with call-stack attribution.
For new code the rules are simple: use std::atomic<T> and keep atomically accessed fields out of packed structures. For existing code with suspected split locks, perf and the kernel’s warn mode are the practical diagnostic path. The hardware will absorb the cost silently otherwise, and the only evidence is a system performing substantially worse than its memory subsystem should allow, with no obvious explanation in a standard profiler.