The Split Lock Penalty: Why Misaligned Atomics Stall Every Core on Your System

The split lock is one of those hardware behaviors that looks like a minor edge case until you measure it. An atomic read-modify-write operation that crosses a 64-byte cache line boundary carries a penalty that can exceed a thousand cycles, and the damage extends well beyond the core executing the instruction. When Chips and Cheese investigated split locks across several x86 microarchitectures, the results confirmed what the CPU manuals describe but rarely emphasize: misaligned atomic operations are not just slow, they stall the entire memory subsystem.

How x86 Atomic Operations Normally Work

The LOCK prefix on x86 instructions like ADD, XCHG, CMPXCHG, and others guarantees atomicity across the visible memory system. In practice, modern CPUs do not need to assert anything on a physical bus to achieve this; they use the MESI cache coherence protocol to ensure exclusive ownership of the relevant cache line before modifying it.

The protocol works by having the issuing core request the cache line in the M (Modified) or E (Exclusive) state. Once the cache line is in that state, no other core holds a valid copy, so the core can perform its read-modify-write without external coordination. The entire operation stays within the cache hierarchy, and contention is handled by coherence traffic between caches rather than by locking any physical bus. A locked operation on a reasonably hot cache line costs somewhere between 30 and 100 cycles depending on contention, microarchitecture, and NUMA topology. Expensive relative to an uncontended load, but manageable.

The Cache Line Boundary Problem

A cache line on x86 is 64 bytes, and cache lines are naturally aligned: they start at addresses divisible by 64. A 4-byte integer at address 60 occupies the last 4 bytes of one cache line. A 4-byte integer at address 62 spans bytes 62-63 of one cache line and bytes 0-1 of the next.

When a LOCK-prefixed instruction accesses memory that spans two cache lines, the MESI protocol is not sufficient to guarantee atomicity. The protocol can grant exclusive ownership of one cache line at a time through a single coherence request. Acquiring exclusive ownership of two cache lines in a single atomic sequence requires something more, because another core could in principle modify the second cache line between the two ownership acquisitions.

x86 solves this by falling back to a bus lock. The CPU asserts a LOCK# signal on the memory bus, or its modern equivalent through the interconnect fabric, which prevents all other agents from accessing memory for the duration of the operation. This is not cache-level synchronization; it is a system-wide exclusive hold on the memory interface. The Intel Software Developer’s Manual, Volume 3, Chapter 8 describes the behavior:

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete).

The consequence is that while one core holds a bus lock for a split operation, every other core that tries to access memory is stalled. Not just accesses to the contested cache lines, but all memory traffic. On a system with many cores, this is a machine-wide serialization event triggered by a single misaligned access.

Measuring the Damage

The Chips and Cheese investigation measures split lock latency across multiple Intel and AMD microarchitectures, isolating the cost of the split lock itself rather than surrounding code. On modern Intel cores, a split LOCK XADD lands in the range of several hundred to over a thousand cycles, compared to 15-40 cycles for the same operation aligned within a cache line. The ratio varies by microarchitecture, but the split case is consistently at least an order of magnitude more expensive; under contention the gap widens further because each bus lock serializes all competing cores.

AMD processors follow the same pattern. The AMD Architecture Programmer’s Manual specifies the same bus lock assertion requirement for accesses spanning cache line boundaries, and empirically AMD microarchitectures show comparable overhead: aligned atomics go through the coherence protocol efficiently, misaligned atomics fall back to bus locking.

The system-wide impact is not easy to observe from a single thread in isolation, but it becomes clear under concurrent load. When one thread executes split locks at high frequency, all other threads on the machine observe elevated memory latency because they are blocked waiting for the bus to become available. This is what makes split locks particularly damaging in production: the cost does not stay in the offending thread or even the offending core.

What the Linux Kernel Does About It

Intel added hardware detection for split lock attempts in the Tremont microarchitecture, introduced in 2019. When the MSR_TEST_CTRL register has the SPLIT_LOCK_DETECT bit set, the CPU raises an alignment check exception (#AC) on any locked instruction that crosses a cache line boundary, before executing the bus lock.

The Linux kernel added support for this starting in version 5.2. The behavior is controlled by the split_lock_detect= boot parameter, with three modes:

off: No detection; split locks proceed silently.
warn: The kernel logs a warning with the offending process name and instruction pointer, then allows the operation to complete. Useful for auditing production systems without disrupting running processes.
fatal: The kernel sends SIGBUS to the offending process.

Later kernels added a ratelimit mode to throttle warning output when an application generates split locks at high frequency. The kernel documentation covers all modes and their interactions with virtualization.

From kernel 5.7 onward, the default mode on supported hardware is warn. Any x86 system running a modern kernel on Tremont-or-later hardware, including the Jasper Lake and Elkhart Lake Atom derivatives, will log warnings for split locking processes. This makes dmesg a first-class diagnostic tool for this class of bug.

Virtualization Amplifies the Problem

Split locks in guest virtual machines are a meaningful operational concern for cloud providers. The bus lock asserted by a split lock in a guest is a physical bus lock; the hypervisor cannot contain it within the guest’s resource domain. A guest executing split locks at high frequency degrades memory throughput for every other VM on the same physical host.

KVM added split lock handling to address this. The KVM documentation describes two strategies: pass the #AC exception through to the guest when the guest has split lock detection configured, or intercept and emulate the split lock in the hypervisor. The emulation path is slower but allows the hypervisor to rate-limit the damage.

This was not a theoretical concern. The multi-tenant interference problem motivated both the kernel-level detection work and the KVM integration. A single buggy guest workload generating unaligned atomics at high frequency was capable of measurably degrading latency for neighboring tenants, which is the kind of problem cloud providers notice quickly.

Why This Happens in Real Code

Split locks generally arise from three sources: manually packed structs where fields are placed without alignment padding, position-independent code that places atomic counters at whatever offset follows the preceding layout, and occasionally compiler or linker placement decisions that put a shared variable across a cache line boundary.

The most common scenario in systems code is a packed struct. C11 and C++11 guarantee that _Alignof(T) for any primitive type is at least the type’s natural alignment: a uint32_t will be 4-byte aligned by default. But __attribute__((packed)) disables this guarantee entirely, placing each field at the offset immediately following the previous one regardless of alignment.

struct __attribute__((packed)) header {
    uint8_t  version;     // offset 0
    uint8_t  flags;       // offset 1
    uint32_t sequence;    // offset 2 -- crosses 4-byte boundary, potentially cache-line boundary
    uint32_t checksum;    // offset 6
};

If sequence in a layout like this ends up at offset 62 within a cache line, a _Atomic or __sync_fetch_and_add on it generates a bus lock. The fix is either to add explicit padding or to annotate the field with __attribute__((aligned(4))), which is respected even inside a packed struct.

For shared atomic variables in concurrent data structures, the rule of thumb is simple: ensure each atomic is aligned to its own size. A naturally aligned object of width N cannot cross a cache line boundary when N is 64 or smaller, because both the object’s start and end fall within the same 64-byte region. The C11 _Alignas and C++11 alignas specifiers handle this at the declaration site:

struct counters {
    _Alignas(8) _Atomic uint64_t hits;
    _Alignas(8) _Atomic uint64_t misses;
};

Checking for existing split locks is straightforward on supported hardware: boot with split_lock_detect=warn, run the workload, and inspect dmesg. The output includes process name, PID, and the instruction pointer of the offending instruction, which is usually enough to identify the source directly or to feed into addr2line.

The Architectural Lesson

Split locks are a case where the hardware’s performance model has a hard discontinuity. Cache-coherent locking scales reasonably well: contention is a function of how many cores compete for the same cache line, and the coherence traffic stays proportional to that competition. Bus locking does not scale; it serializes the entire machine regardless of which cores actually need the contested data.

The x86 architecture chose to support unaligned locked accesses at all rather than trap them as errors, which is a backward-compatibility decision that predates multicore systems. The Tremont detection feature and the Linux kernel integration represent the ecosystem eventually catching up to the cost of that decision. Measuring the penalty empirically, as the Chips and Cheese investigation does, puts numbers on behavior that too many developers have absorbed only as vague folklore: misaligned atomics are slow in the way that bus contention is slow, meaning the problem compounds under load and the cost shows up where you are least likely to look for it.