· 6 min read ·

The Hidden Tax of Misaligned Atomics: Split Locks on x86-64

Source: lobsters

Most developers working with atomics on x86 think about the LOCK prefix as a fairly cheap operation. Cache coherency protocols handle it, you pay a few cycles for MESI to do its work, and you move on. That mental model is correct for the common case. It breaks down completely the moment a locked access crosses a 64-byte cache line boundary, and the result is one of the more dramatic single-instruction performance cliffs in modern microarchitecture.

The Chips and Cheese investigation into split locks measures this empirically across several microarchitectures. The numbers are worth sitting with before diving into the mechanism.

What a Cache Line Boundary Actually Means for Atomicity

Cache coherency on x86 is managed through the MESI protocol (and extended variants like MESIF on Intel and MOESI on AMD). When a core wants to perform a locked read-modify-write on a memory location, the cache controller negotiates exclusive ownership of the relevant cache line. Other cores holding that line in shared state are invalidated. The operation proceeds within the L1 cache, and no bus signal ever goes out. This is fast. On modern cores, a lock xchg on a hot, well-aligned address costs roughly 40 cycles.

The problem with a split lock is geometric. If a 4-byte or 8-byte locked access starts 2 bytes before the end of a cache line, it spans two lines. MESI cannot atomically transfer ownership of two cache lines simultaneously. The protocol simply doesn’t have that operation. To maintain the atomicity guarantee the x86 ISA has promised since the 8086 era, the CPU falls back to a different mechanism: it asserts the LOCK# signal on the external memory bus, serializing all memory traffic across every core for the duration of the access.

This is a system-wide stall. Every core that wants to touch memory waits. The penalty scales with core count in the worst case, and on a busy multi-socket system it can reach into the microsecond range for a single instruction.

The Hardware Path for a Split Lock

When the CPU detects that a LOCK-prefixed instruction will cross a cache line boundary, the pre-execution pipeline recognizes the condition before the memory access is issued. The core then:

  1. Drains the store buffer and ensures all prior memory operations are visible
  2. Asserts the system bus lock, blocking all other agents from issuing transactions
  3. Performs the first half of the access (the portion in the lower cache line)
  4. Performs the second half (the portion in the upper cache line)
  5. Deasserts the bus lock

The total latency includes two separate cache line accesses plus the serialization cost. On Skylake-era hardware, measurements from Intel’s optimization manual and third-party benchmarks suggest split lock latency in the range of 1,000 to 5,000 cycles, versus roughly 40 cycles for a non-split lock. That ratio only gets worse under contention.

AMD hardware behaves similarly in principle, though the exact microarchitectural path differs. AMD’s implementation historically caused the bus lock assertion to be visible at the APIC level, where it could delay interrupt delivery in addition to stalling memory traffic.

Why This Still Exists

The question of why x86 still supports split locks at all is reasonable. The answer is backward compatibility at its most uncompromising. Code from the DOS era, BIOS implementations, older JIT compilers that didn’t align their atomic emission correctly, and various firmware blobs have relied on split lock behavior for decades. Intel and AMD cannot remove it without breaking a long tail of software.

What Intel did instead, starting with Tiger Lake microarchitecture, was add a detection and fault mechanism. A bit in the MSR_TEST_CTRL model-specific register (address 0x33) controls SPLIT_LOCK_DETECT behavior. When set, the CPU raises a #AC (Alignment Check) exception instead of silently performing the bus lock. This gives the OS a hook to either terminate the offending process, emulate the access without a bus lock, or at minimum log a diagnostic.

Linux merged support for this in kernel 5.8 via work from Tony Luck at Intel. The split_lock_detect boot parameter controls behavior:

  • off: no detection, silent bus lock (legacy behavior)
  • warn: log a rate-limited warning, allow the operation to proceed
  • fatal: send SIGBUS to the offending process
  • ratelimit: warn with a per-process rate limit

The default varies by kernel version and distribution. Fedora and RHEL have shipped with detection enabled at the warn level in some configurations, which occasionally surfaced surprising findings: certain applications, including some older games running under Wine and specific Java workloads, were generating split locks at non-trivial rates.

The Virtualization Complication

Split locks interact badly with virtualization. When a guest VM performs a split lock, the hypervisor must decide how to handle the bus lock assertion. KVM gained explicit split lock handling with the introduction of #AC-based detection. Without it, a misbehaving guest could effectively denial-of-service the host by generating split locks in a tight loop, starving all other VMs of memory access.

The KVM implementation intercepts the #AC fault, determines whether it originated from a split lock condition (as opposed to a legitimate alignment fault from user code that set the AC flag), and either injects a #AC into the guest or emulates the instruction without a bus lock. The emulation path is expensive but bounded.

Hyper-V handles this similarly through its own hardware-assisted virtualization path, and the Linux kernel’s handling has been refined through several patch series to correctly distinguish split lock faults from other alignment exceptions.

Alignment and the Compiler’s Role

For most application code, split locks are an accidental artifact rather than intentional design. They arise from:

  • Structs packed with __attribute__((packed)) that place atomic fields at non-natural alignments
  • Manual memory layout decisions that prioritize size over alignment
  • Dynamic allocation that returns memory aligned to 8 or 16 bytes, not necessarily 64
  • Lock-free data structures where a field straddles the natural boundary of the containing struct

The C and C++ standards technically require atomic operations to be performed on naturally aligned addresses, and most compilers enforce this. But _Atomic in C and std::atomic in C++ don’t prevent you from placing an atomic member in a packed struct. The compiler may emit correct code by its own accounting while producing a split lock at runtime.

A useful diagnostic pass is perf stat -e split_cycleson Linux, which counts bus lock cycles. On kernels withsplit_lock_detectavailable, aperf` event captures the count. Seeing any non-zero value in production is worth investigating.

// This can produce a split lock if the struct's packing
// places 'counter' across a cache line boundary
struct __attribute__((packed)) bad_layout {
    char pad[62];
    _Atomic int counter;  // straddles a 64-byte cache line
};

Fixing it is usually straightforward once identified: add alignment attributes, restructure the type, or pad to ensure atomic fields land within a single cache line.

What the Measurements Reveal

Empirical work like the Chips and Cheese investigation is valuable here because it puts concrete cycle counts against what the architecture manuals describe only in qualitative terms. The findings across different microarchitectures confirm that the penalty is not uniform. Older cores tend to serialize more aggressively. Newer cores have tightened some of the surrounding pipeline behavior. But across all tested parts, the split lock cost is in a different order of magnitude from an aligned atomic, not just a modest surcharge.

The investigation also highlights variance under contention, which is where split locks go from a performance curiosity to a real systems problem. A single thread generating split locks occasionally is noise. Multiple threads generating them concurrently amplifies the bus lock duration because each one must wait for the previous to complete. The degradation is not linear.

Understanding the mechanism clarifies why the Linux kernel team treats split lock detection as a useful hardening feature rather than a narrow BIOS concern. In virtualized and containerized environments, one misbehaving workload can impose latency on everything else sharing the physical host. Making that behavior detectable and controllable is the right call, even if the defaults are gentle by necessity.

Was this interesting?