· 7 min read ·

The x86 Split Lock: How a Misaligned Atomic Becomes a System-Wide Stall

Source: lobsters

The x86 atomic model makes a simple guarantee: a locked instruction executes indivisibly. No other CPU, DMA controller, or PCIe device should observe a partial read or write. What the guarantee leaves unstated is that enforcing it costs very different amounts depending on where the operand lands in memory.

The Chips and Cheese investigation into split locks measures this directly across multiple microarchitectures, and the numbers are striking enough to warrant understanding the full mechanism behind them.

The Two Locking Paths

On any modern x86 CPU, there are two ways to honor the atomicity guarantee for a LOCK-prefixed instruction.

The common path is the cache-line lock. When the operand fits entirely within a single cache line, the CPU acquires exclusive ownership of that line through the MESI coherency protocol, performs the read-modify-write, and releases ownership. The whole operation happens inside the coherency fabric and is invisible to the rest of the system. On cache-hot data, this costs roughly 10 to 20 cycles.

The split lock path activates when the operand crosses a 64-byte cache line boundary. Exclusive ownership of two cache lines simultaneously is not something MESI handles directly, so the CPU falls back to the mechanism inherited from the 1978 Intel 8086: the hardware LOCK# pin. The CPU asserts this signal on the memory bus, preventing any other bus master from completing a memory transaction for the duration of the operation. Both cache lines get fetched, the read-modify-write completes, both get written back, and then LOCK# is de-asserted.

That path costs somewhere between 500 and 1,000 cycles on a single core depending on microarchitecture. Against the 10-to-20-cycle baseline, the difference is roughly 50 to 100 times.

Why the SMP Case Is So Much Worse

Single-core performance is bad enough, but the reason split locks became a serious concern only after multicore processors arrived is the scope of the LOCK# assertion. When one core asserts the bus lock, all other cores stall their pending memory operations, not just the operations touching the affected cache lines, but all memory operations. Every core on the package sits idle until the lock clears.

On a 2-socket server with 64 cores, a single split lock in one thread stalls 63 other cores. If multiple threads hit split locks simultaneously, they queue behind each other, each serializing the entire memory bus for hundreds of cycles. The effective throughput penalty in a high-thread-count system can exceed 500 times compared to properly aligned operations.

This is the finding the Chips and Cheese article documents empirically. Using a microbenchmark based on LOCK XADD with the operand offset varied one byte at a time relative to a cache line boundary, the latency cliff at the boundary crossing is consistent across Intel Skylake, Ice Lake, Alder Lake, and multiple AMD Zen generations. The exact cycle counts differ, but the performance collapse at misalignment is present on all of them.

The Alignment Rule That Prevents It

A split lock requires the operand to straddle a 64-byte boundary. That can only happen if the operand is not naturally aligned to its own size. A 4-byte operand at any 4-byte-aligned address can never span two 64-byte cache lines, since 64 is a multiple of 4. The same holds for 2, 8, and 16-byte operands: natural alignment is a complete preventive measure.

std::atomic<T> in C++ is always naturally aligned by default. The ABI guarantees alignof(std::atomic<T>) >= sizeof(T), so standard atomic operations cannot produce a split lock through ordinary code. The compiler will never generate one on your behalf when you use the standard library correctly.

The main hazard is packed structs. Annotating a struct with __attribute__((packed)) or #pragma pack allows fields to be placed at arbitrary byte offsets, which means an atomic-sized field inside a packed struct can land at a misaligned address.

// This is dangerous
struct __attribute__((packed)) Counter {
    uint8_t  tag;      // offset 0
    uint64_t count;    // offset 1: not 8-byte aligned
};

// An atomic increment on count issues a LOCK-prefixed
// instruction to an 8-byte operand at a non-8-aligned address.
// If that address happens to sit at bytes 57-64 of a cache line,
// you get a split lock. Whether it does depends on where the
// struct lands in memory -- which means the bug is intermittent.

JIT-compiled runtimes are another historical source. Older JVM releases and some .NET CLR versions had cases where object field layouts or array element access patterns produced misaligned atomic operations in hot paths. These were corrected after the issue gained broader visibility, but any runtime that generates native code must track alignment through its allocator and field placement logic to stay safe.

Manual pointer arithmetic that discards alignment information is a third path. Casting a char* offset into the middle of a buffer to an atomic<uint32_t>* and then operating on it is undefined behavior in C++ and can produce a split lock in practice.

How the Linux Kernel Responds

For most of the multicore era, split locks executed silently. The CPU would take the expensive bus lock path, software would observe a performance anomaly, and without careful instrumentation the cause would remain invisible.

Around 2018 to 2020, cloud infrastructure teams observed that a guest VM could deliberately issue split locks to degrade host performance. Because the LOCK# stall affects the entire physical machine, a tenant in one VM could impose memory serialization on other tenants. This elevated what had been a performance-tuning concern to a reliability and security issue.

Intel responded by adding hardware split lock detection to Tremont and Ice Lake in 2019. The mechanism is controlled by bit 29 of MSR_TEST_CTRL (MSR address 0x33). When SPLIT_LOCK_DETECT is set, the CPU raises an #AC (Alignment Check) fault before executing the split lock rather than silently proceeding. The OS then has a hook to intercept and handle the event. The feature is advertised via CPUID.(EAX=7, ECX=0):EDX and appears in /proc/cpuinfo as split_lock_detect on supported hardware.

Linux 5.7 (2020) added support via the split_lock_detect= boot parameter, with the implementation primarily from Tony Luck at Intel. The parameter takes three values:

  • off: detection disabled
  • warn: log the offending process name and PID, allow the operation to complete
  • fatal: send SIGBUS to the offending process

When the #AC handler fires, the kernel clears SPLIT_LOCK_DETECT so the instruction can complete without looping, applies the configured policy, then re-enables the detection bit on the next context switch via a per-task flag. A later patch added a ratelimit mode that allows a fixed number of split locks per time window before escalating, accommodating workloads that produce occasional split locks without being worth killing.

The warn mode proved important because some real-world software depended on misaligned atomics. Certain Windows binaries running under Wine and older JVM-based tools generated split locks in hot paths and would be broken by fatal mode. The warn default gives operators visibility without breaking production.

A separate but related mechanism covers bus locks that fall outside what SPLIT_LOCK_DETECT catches, particularly locked accesses to uncacheable (UC) or write-combining (WC) memory regions, which do not raise #AC. Intel added PMU-based bus lock detection on Ice Lake and later. On those CPUs, perf stat -e bus-lock counts actual hardware bus lock events regardless of cause.

Finding Split Locks in Practice

With a supported CPU and Linux, the most direct approach is split_lock_detect=warn at boot. Any split lock anywhere in the system produces a kernel log entry with the process name and PID. For testing a specific binary this is immediate.

With perf on Ice Lake or newer, mem_inst_retired.split_loads and mem_inst_retired.split_stores count all split memory accesses including non-atomic ones. The bus-lock event specifically counts the expensive bus lock path. Intel VTune’s Platform Analysis mode can identify split lock hotspots at the source line level, which is useful when the offending code is generated by a JIT and the kernel log entry only gives a process name.

For static analysis, the risk concentrates around __attribute__((packed)) structs combined with atomic operations, and around reinterpret_cast patterns that take a pointer to non-atomic storage and treat it as atomic. A source audit searching for packed struct definitions that contain fields of atomic width is a reasonable starting point.

Forty Years of Bus Locking

Split locks have been architecturally possible since the 8086. The Intel i486 (1989) introduced the cache-line lock path and the performance distinction became meaningful. The Pentium Pro documentation through the P6 era noted the penalty clearly in Intel’s optimization manuals, framing it as a programmer-beware issue.

What changed is scale. A split lock on a uniprocessor is an obscure performance bug that affects one process on one machine. On a 64-core server in a cloud environment shared across tenants, the same instruction is a mechanism for one process to impose system-wide memory serialization on every other process on the physical host. The hardware detection that landed in 2019 and the kernel policy infrastructure that followed in Linux 5.7 are the industry’s response to a threat model that did not exist when the instruction set was designed.

The practical upshot is simple: keep atomic operands naturally aligned. The standard library guarantees this for you when you use it correctly. Packed structs with atomic-width fields are the common footgun, and they are worth auditing explicitly because the failures they produce are intermittent, layout-dependent, and easy to miss without the right instrumentation.

Was this interesting?