· 7 min read ·

The Split Lock Tax: Why One Misaligned Atomic Can Stall Your Whole x86-64 System

Source: lobsters

The Intel LOCK prefix is older than most of the software running on it. It dates back to the original 8086, where asserting LOCK meant physically driving a /LOCK pin low on the chip package, holding the entire system bus and preventing any other device, whether another CPU or a DMA controller, from accessing memory for the duration of the instruction. That mechanism sounds absurdly blunt by modern standards. It was. But it was also the only way to implement atomic read-modify-write operations on a shared bus architecture.

Modern x86-64 systems are nothing like the 8086 bus topology. They have private L1 and L2 caches, shared L3 caches, and a cache coherence protocol (some variant of MESI or its derivatives) that handles concurrent access between cores without ever touching a physical bus lock. For the common case of a LOCK-prefixed instruction accessing a single, cache-line-aligned memory location, the CPU just follows the coherence protocol: it brings the line into the Modified state, performs the operation, and the line stays coherent. Fast, scalable, invisible to other cores that are not actually contending on that address.

The ghost of the 8086 bus lock survives, however, in exactly one situation: when a LOCK-prefixed instruction accesses data that spans two cache lines.

Cache Lines and the Alignment Contract

On x86-64, a cache line is 64 bytes. A naturally aligned 8-byte value sits entirely within a single 64-byte block, because any address divisible by 8 is at most 56 bytes from the start of its cache line. A 4-byte value at a 4-byte-aligned address works the same way. The alignment ensures the value fits inside one line.

The problem arises when alignment is broken. A 4-byte integer at byte offset 62 within a cache line occupies bytes 62 and 63 of that line, then bytes 0 and 1 of the next. When code performs an atomic operation on that address with the LOCK prefix, the CPU faces a fundamental constraint: the cache coherence protocol can ensure exclusive access to one cache line at a time, not two simultaneously. Atomicity across two lines cannot be guaranteed by MESI alone.

The fallback is to reassert the legacy bus lock behavior: lock the memory bus (or its modern equivalent, the coherence interconnect) for the entire duration of the operation, preventing any other agent from accessing any memory. This is a split lock, and it is the subject of recent microarchitectural investigation that makes clear how severe the penalty remains on modern hardware.

What Actually Happens During a Split Lock

The performance cost breaks down into two components. First, there is the direct stall on the core performing the split lock. The CPU must drain its pipeline, acquire exclusive access to both cache lines across two separate cache line state machines, perform the operation, and then release the lock. The cycle cost for this alone can run into the hundreds of cycles on modern microarchitectures.

Second, and more damaging at a system level, is the broadcast effect. The bus lock stalls memory traffic across all cores on the socket, not just the one performing the operation. Any core that attempts a memory access while the split lock is in flight must wait. On a system with many cores, this turns one misaligned atomic into a system-wide stall. Under contention, where multiple threads might be hitting the same split-locked address, the performance degradation compounds in a way that is non-linear and nearly impossible to attribute cleanly in a profiler.

Intel’s own optimization manual has noted for years that split locks should be avoided. The documentation describes the bus locking behavior and recommends natural alignment for all shared atomic variables. This advice has been in the manual for decades; the question is whether everyone writing lock-free data structures has read it.

How Split Locks Happen in Real Code

The most common sources are packed structs and pointer arithmetic that discards alignment information.

// Packed struct: the compiler will not insert padding.
// 'counter' may land exactly on a cache line boundary.
struct __attribute__((packed)) stats {
    char label[62];
    int32_t counter;   // starts at offset 62, spans cache line boundary
};

// Somewhere later, a lock-free increment:
__atomic_fetch_add(&s.counter, 1, __ATOMIC_SEQ_CST);
// If s is cache-line aligned, this is a split lock.

The same problem occurs in C++ with #pragma pack, in hand-written assembly using LOCK-prefixed instructions on calculated addresses, and in any code that casts a misaligned pointer to an atomic type. The compiler has no obligation to warn about this, and in the general case it cannot: alignment is a runtime property of the address, not the type.

The C11 _Atomic qualifier and POSIX atomic_* functions do not help here either. They guarantee atomicity in the memory model sense, but they say nothing about the hardware cost of achieving that atomicity when the address happens to be misaligned.

Intel’s Hardware Detection: SPLIT_LOCK_DETECT

Intel added a hardware mechanism to detect split locks in Tiger Lake (11th generation, 2020) and later microarchitectures. The feature is exposed through the SPLIT_LOCK_DETECT bit in the IA32_CORE_CAPABILITIES MSR. When the OS sets this bit, the CPU raises a #AC (Alignment Check) fault whenever a split lock would otherwise occur. This converts a silent performance disaster into a detectable event.

The #AC exception was originally designed for general misaligned access detection, gated by the AC flag in RFLAGS and the AM bit in CR0. Intel repurposed the exception delivery mechanism for split lock detection specifically, though the two mechanisms are separate: enabling general alignment checking in user mode does not enable split lock detection, and vice versa.

Intel also added a related feature called Bus Lock Detection in Tremont and Ice Lake Server microarchitectures. Bus locks cover a broader class of events than split locks (certain serializing instructions and accesses to non-cached memory regions also generate bus locks), but split locks are the most common source in ordinary application code.

The Linux Kernel Response

Linux added split lock detection support starting around kernel 5.4, with the split_lock_detect boot parameter. The available modes are:

  • off: detection disabled, the CPU performs split locks silently
  • warn: log a rate-limited warning to the kernel log and allow the process to continue
  • fatal: send SIGBUS to the offending process
  • ratelimit: allow a configurable number of split locks per second before escalating

The implementation lives in arch/x86/kernel/cpu/intel.c and related files. When the CPU supports the SPLIT_LOCK_DETECT feature, the kernel sets the detection bit during CPU initialization and installs a #AC handler that dispatches to the configured mode.

For virtual machines, the behavior is more nuanced. A hypervisor may choose to emulate split locks for guest compatibility rather than propagating the fault, particularly when running older operating systems or workloads that were written without awareness of the detection mechanism. KVM exposes a split_lock_detect setting at the VM level for exactly this reason.

The warn mode is the default on most distributions, chosen as a balance between catching regressions and not breaking legacy code that might inadvertently rely on split lock behavior. Setting fatal on a production system would expose any misaligned atomics in user-space libraries, which is useful in a testing environment but risky in production without prior auditing.

AMD processors handle split locks differently. AMD does not have an equivalent of Intel’s SPLIT_LOCK_DETECT hardware feature in the same MSR-based form, though they still incur the performance penalty for split lock operations. The Linux kernel’s split lock detection path is conditional on Intel-specific capability bits.

Avoiding Split Locks in Practice

The fix is alignment. For shared atomic variables, ensure they are aligned to at least their natural alignment. For frequently contended hot variables, consider aligning to a full cache line to also prevent false sharing.

// Natural alignment via _Alignas (C11):
_Alignas(4) _Atomic int32_t counter;

// Cache-line aligned to also prevent false sharing:
_Alignas(64) _Atomic int64_t hot_counter;

// In C++:
alignas(std::hardware_destructive_interference_size) std::atomic<int> counter;

For structs containing atomic members, avoid the packed attribute where the struct contains shared atomics. If you need packing for serialization, separate the atomic fields into their own naturally aligned struct.

GCC provides -Waddress-of-packed-member, which warns when taking the address of a packed struct member that would be misaligned. This catches some cases, but only when the address is taken directly, not when a previously computed misaligned pointer is passed to an atomic operation elsewhere.

In Rust, the standard library’s atomic types are naturally aligned by definition, but placing an atomic inside a #[repr(packed)] struct and then operating on it through a raw pointer in unsafe code produces the same hardware penalty. The type system does not protect you from misaligned atomics when you are operating in unsafe territory.

What the Microarchitectural Investigation Reveals

The Chips and Cheese investigation of split locks on x86 provides empirical cycle-count data across recent microarchitectures, filling in a gap that Intel’s documentation describes only in qualitative terms. The measurements confirm that the penalty is not a minor accounting error in a hot loop: it is the kind of slowdown that surfaces in production profiles if the code path is exercised with any regularity.

What makes split locks particularly difficult to diagnose is that the cost does not appear localized. A profiler samples the faulting instruction on the core performing the split lock, but the wall clock time lost by other cores waiting for the bus lock does not appear against any instruction in those cores’ profiles. The penalty is distributed invisibly across the system, which is one reason the Linux kernel added explicit detection rather than relying solely on profiling to find these.

The deeper issue is that x86’s commitment to backward compatibility means this 1970s-era bus locking mechanism persists in current silicon. The architecture has layered cache coherence, hyperthreading, NUMA topologies, and generations of optimization features on top of itself, but when you hand it a LOCK instruction that crosses a cache line, it falls back to something functionally equivalent to what the 8086 did. That is a remarkable amount of legacy to carry, and it has real costs for anyone who is not careful about alignment.

Was this interesting?