When a Misaligned Atomic Stops Every Core on the Socket

A few weeks ago, Chips and Cheese published an investigation into split locks on x86-64, benchmarking the penalty across Intel and AMD microarchitectures. The numbers are significant enough to be worth understanding from first principles, because the mechanism behind them reveals something fundamental about how cache coherency actually works.

The Cache Line Boundary Problem

Every x86-64 CPU uses 64-byte cache lines. When you issue a LOCK-prefixed instruction on an aligned address, the CPU handles atomicity entirely within the cache hierarchy: it acquires exclusive ownership of the relevant cache line (transitioning it to the Modified state in MESI terms), performs the read-modify-write, and the whole operation never touches the external memory bus. That’s why aligned atomics on modern hardware are relatively cheap, typically in the 10-20 nanosecond range under no contention.

The problem appears when the target address straddles a cache line boundary. A 4-byte LOCK XADD at offset 62 within a cache line needs bytes from two different cache lines. MESI cannot atomically lock two cache lines at once; the coherency protocol is fundamentally per-line. There is no “lock two lines” transaction in any standard coherency implementation.

So the CPU falls back to the bus lock: it asserts the LOCK# signal on the external interconnect, holds it for the duration of the read-modify-write across both lines, and releases it only after both writes complete. On modern Intel ring-bus or mesh designs, this serializes the entire uncore fabric. Every other core waiting on a memory operation has to wait out your bus lock.

The condition for triggering a split lock is straightforward:

(address % 64) + operand_size > 64

For a 4-byte atomic, that means any address at offset 61, 62, or 63 within a cache line. For an 8-byte atomic, it’s offsets 57 through 63. For CMPXCHG16B, it’s offsets 49 through 63.

What the Penalty Looks Like in Practice

The performance gap between aligned and split-locked atomics is not subtle. An aligned LOCK XADD with no contention runs in roughly 10-20 ns. A split-locked LOCK XADD under the same conditions takes approximately 1,000-2,000 ns, around 100 times slower. Under contention with multiple cores hammering the same split-locked address, latencies can climb into the tens of microseconds per operation.

What makes this particularly damaging is the system-wide nature of the stall. When one thread holds a bus lock, every other core on the socket stalls for any memory operations that need to complete. This means a single misbehaving thread can degrade throughput across all cores, not just its own. On a 32-core server under a memory-intensive workload, one thread doing a tight loop of split-locked atomics can noticeably reduce the throughput of the other 31.

On multi-socket systems the situation is worse: the lock has to propagate across the QPI or UPI inter-socket link, which is both slower and more globally disruptive.

Intel vs. AMD Implementation

Both Intel and AMD implement bus locking for split locks, but the specifics differ. Intel’s implementation on its ring-bus and mesh interconnects serializes the entire uncore fabric. AMD, using its Infinity Fabric (formerly HyperTransport), implements the same logical guarantee through its own interconnect locking mechanism. The Chips and Cheese investigation found measurably different timing characteristics between the two, though both confirm the same rough magnitude of penalty.

More practically, Intel and AMD diverge on detection hardware. Intel added a split lock detection mechanism in Tremont and Tiger Lake (2019-2020 timeframe), exposed through bit 29 of MSR_TEST_CTRL (address 0x33). Setting this bit causes the CPU to raise an #AC (Alignment Check) fault when a user-space instruction triggers a split lock, giving the OS a hook to catch and handle them. AMD added an equivalent mechanism in Zen 4 (2022). Earlier AMD processors (Zen 1 through Zen 3) have no hardware detection capability at all.

Linux’s Response: split_lock_detect

Linux 5.8, released in August 2020, added the split_lock_detect feature, largely driven by Tony Luck at Intel. The kernel uses the MSR_TEST_CTRL mechanism to enable #AC faults for split locks on supported hardware, then handles those faults in the alignment check exception handler.

The behavior is configurable via a boot parameter:

split_lock_detect=off       # silent, legacy behavior
split_lock_detect=warn      # WARN_ONCE per offending process, continue
split_lock_detect=fatal     # send SIGBUS to the offending process
split_lock_detect=ratelimit # warn with rate limiting

Linux 5.17 added the kernel.split_lock_mitigate sysctl for runtime control. The default mode in recent kernels is warn, which logs something like:

split lock detected: myprocess[1234]: [0x7f...]

This is useful for finding offenders in production without immediately killing them. The fatal mode is appropriate for environments where split locks represent a clear bug, such as new code under development or containers where the operator wants hard enforcement.

Handling kernel-mode split locks (ring 0) is more delicate, since the kernel cannot send itself a signal. In warn mode, a kernel-space split lock produces a one-time WARN() and continues. Part of the feature’s development included an audit of in-tree kernel code for misaligned atomic operations.

The Virtualization Problem

The security implications of split locks are most pronounced in multi-tenant virtualization. A guest VM executing split-locked atomics causes the physical host’s bus to be held, stalling all other VMs on that host. No privilege escalation is required. A normal user inside a VM can run:

while (1) {
    // atomic on address that straddles a cache line boundary
    __atomic_fetch_add(misaligned_ptr, 1, __ATOMIC_SEQ_CST);
}

and measurably degrade every other tenant’s memory-intensive workloads. This is a denial-of-service vector that predates split_lock_detect and was part of the motivation for the feature.

Intel addressed this more directly in Ice Lake Server (third-generation Xeon Scalable, 2021) with the Bus Lock VM Exit feature, controlled by CPUID leaf 0x7.0:ECX bit 24. When a hypervisor enables this via a VMCS bit, a split lock in a guest generates a VM exit (exit reason 74) rather than silently executing. The VMM can then throttle, log, or terminate the offending guest.

KVM support for bus lock VM exits landed in Linux 5.14 (July 2021). QEMU was subsequently updated to expose KVM_CAP_X86_BUS_LOCK_EXIT so operators can configure per-VM policies. AMD’s equivalent virtualization support arrived later with Zen 4.

Finding Split Locks in Your Code

The most reliable runtime approach is the Linux kernel’s own warning infrastructure. With split_lock_detect=warn enabled, any offending process will appear in dmesg. On Intel CPUs with PMU support, you can count bus lock events directly:

perf stat -e cpu/event=0xe,umask=0x1/ ./program

For more targeted analysis, perf c2c can identify cache-line-level hot spots where split locks are likely:

perf c2c record ./program
perf c2c report

At the source level, the most common causes are packed structs with atomic members and pointer arithmetic that ends up misaligned:

// packed struct places counter at offset 1, guaranteed split lock
struct __attribute__((packed)) stats {
    char type;
    uint64_t counter;  // offset 1 — 8 bytes at offset 1 crosses cache lines
};

// or manual pointer arithmetic gone wrong
uint64_t *ptr = (uint64_t *)((char *)base + 1);
__atomic_fetch_add(ptr, 1, __ATOMIC_SEQ_CST);

The fix is alignment. _Alignas(8) for 8-byte atomics, alignas(64) if you also want to avoid false sharing:

_Alignas(8) _Atomic uint64_t counter;

For CMPXCHG16B (128-bit atomics), the alignment requirement is 16 bytes. The C++ standard already requires std::atomic<T> to be naturally aligned, so using the standard library correctly protects you. The hazards come from packed structs, custom allocators returning insufficiently aligned memory, and hand-rolled inline assembly.

Why This Keeps Coming Up

Split locks have been a known issue since the original Pentium, but they surface periodically because the conditions that cause them are easy to create accidentally. The __attribute__((packed)) pattern is common in network code and serialization libraries. Custom memory allocators sometimes guarantee only 4-byte alignment. Code ported from 32-bit x86 may have assumptions that held on older hardware but become split locks on 64-bit atomics.

The Wine project ran into this concretely: Windows does not kill processes for misaligned atomics, so some Windows applications rely on the silent behavior. When those applications run under Wine on Linux with split_lock_detect=fatal, they crash. Wine added a compatibility mode to disable split lock detection per-process for this reason.

The Chips and Cheese investigation adds useful empirical grounding to the theoretical picture, particularly in comparing Intel and AMD’s microarchitectural handling in detail. The hardware behavior has been documented in Intel’s SDM for decades, but seeing cycle-accurate measurements across several microarchitectures makes the performance characteristics concrete rather than theoretical.

The broader lesson is about what cache coherency protocols actually guarantee. MESI is elegant and fast for single-cache-line operations, but the moment you cross a line boundary with an atomic, you’ve left the coherency layer entirely. The x86 LOCK prefix was part of the ISA before modern cache coherency existed, and the bus lock fallback is the seam between two eras of processor design, one that still shows up in production code forty years later.