· 8 min read ·

The Bus Lock Hangover: What Split Locks Reveal About x86 Atomics

Source: lobsters

The LOCK prefix on x86 is one of those things that looks simple from the outside. You slap it in front of an instruction, and the operation becomes atomic. The hardware handles the rest. But the “rest” has changed dramatically over the decades, and there is a narrow class of accesses where modern CPUs fall back to behavior that would be familiar to a programmer from the early 1980s: asserting a physical lock signal on the memory bus and stalling every other processor on the system until the operation completes.

That fallback is called a split lock, and a recent investigation on Chips and Cheese digs into how it actually behaves on modern x86-64 silicon. The measurements are striking, but to appreciate why the numbers look the way they do, it helps to understand what the CPU is doing and why.

From Bus Pins to Cache Lines

In the original 8086 and through much of the x86 lineage, the LOCK prefix was straightforward in a brutal way: it caused the CPU to assert the LOCK# pin on the front-side bus, which prevented any other bus master from accessing memory until the operation completed. Every locked instruction was, in effect, a system-wide pause.

This worked fine when systems had one CPU and the bus was the only path to memory. As systems grew to multiple CPUs and the memory hierarchy deepened, bus-level locking became a significant bottleneck. The solution that emerged was cache-level locking, made possible by the cache coherence protocols (MESI and its extensions) that multi-processor x86 systems adopted.

With cache locking, a LOCK-prefixed instruction that operates on a single cache line does not need to assert LOCK# at all. The processor can lock the relevant cache line within its own L1 cache, perform the read-modify-write atomically, and let the coherence protocol propagate the result. The Intel Software Developer’s Manual describes this as follows: “For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor.” Starting with the P6 family, Intel shifted this: if the memory being locked is cached and wholly within a cache line, the processor uses cache locking and does not assert LOCK#.

The critical phrase is “wholly within a cache line.” x86 cache lines are 64 bytes. If a locked access fits inside one 64-byte-aligned block, the processor can handle it efficiently via the coherence protocol. If the access spans two cache lines, the processor has no choice: it reverts to the old bus-locking mechanism.

What Makes a Split Lock

A split lock occurs when a LOCK-prefixed memory operation straddles a cache line boundary. Consider a 4-byte atomic store to address 0x7ffc000000003e. That address is 2 bytes before the end of one 64-byte cache line and 2 bytes into the next. The CPU cannot lock two separate cache lines atomically without going through the bus. So it asserts LOCK# (or the equivalent mechanism in modern systems without a physical front-side bus), and the entire memory subsystem waits.

This can happen innocently in C:

#include <stdatomic.h>
#include <stdint.h>

struct BadLayout {
    char padding[62];  // pushes the next field across a cache line
    _Atomic uint32_t counter;
};

// counter spans bytes 62-65, crossing the 64-byte boundary at offset 64.
// Any atomic operation on counter is a split lock.
void increment(struct BadLayout *s) {
    atomic_fetch_add(&s->counter, 1);
}

The compiler has no obligation to warn you about this. The code is correct C. The struct layout is legal. The atomic operation will succeed. It will just be catastrophically slow under contention.

You can check alignment at runtime:

#include <assert.h>

void check_no_split_lock(void *addr, size_t size) {
    uintptr_t start = (uintptr_t)addr;
    uintptr_t end   = start + size - 1;
    // Both start and end must fall within the same 64-byte cache line.
    assert((start >> 6) == (end >> 6) &&
           "atomic access spans cache line boundary: split lock risk");
}

At the assembly level, a split lock looks identical to any other locked access. The LOCK CMPXCHG instruction does not carry an “I am a split lock” annotation. The difference is entirely in the relationship between the address and the cache line boundaries, which the CPU determines at execution time.

The Performance Gap

The numbers from empirical measurement are not subtle. A well-aligned locked compare-exchange on a modern Intel core costs roughly 10 to 40 cycles when there is no contention and the data is in L1 cache. Under contention, the cost rises as cores spin against each other, but the coherence protocol manages this gracefully.

A split lock changes the picture entirely. Because the CPU must lock the bus, the operation serializes against every other CPU on the system, not just those competing for the same cache line. Measurements on modern Intel CPUs show split lock latencies in the range of several hundred to several thousand cycles, depending on system topology. On a multi-socket server, the penalty is higher still because the lock has to propagate across socket interconnects.

More importantly, the impact is not local. Every other CPU that attempts a memory access while a split lock is in progress must wait. A single thread doing split locks at high frequency can measurably degrade throughput across the entire system. In a cloud environment where many virtual machines share a physical host, this is a significant problem: one misbehaving or malicious guest can impose latency on unrelated workloads.

How the Linux Kernel Responded

For most of x86’s history, split locks were a performance antipattern but not something the kernel could detect or prevent. That changed in Linux 5.8, which added split lock detection support for CPUs that expose it via the IA32_CORE_CAPABILITIES MSR (MSR index 0xCF). When bit 4 of that MSR is set, the CPU supports split_lock_detect mode: when enabled, a split lock attempt raises a #AC (Alignment Check) fault instead of silently proceeding with the slow path.

The kernel exposes control over this via the split_lock_detect boot parameter, with several modes:

  • off: detection disabled, split locks proceed silently
  • warn: #AC is raised, the kernel prints a warning with a rate limit, and the process continues
  • fatal: #AC is raised, and the kernel sends SIGBUS to the offending process
  • ratelimit=N: warn mode, but limited to N warnings per second globally

The default as of recent kernels is warn, which is conservative but at least surfaces the problem. You can see split lock events in the kernel log:

[12345.678] split lock detected: address 0x... in process foo (pid 1234)

This is separate from, but related to, bus lock detection. “Bus locks” in the kernel’s terminology covers a slightly broader category: locked accesses that cannot be handled by cache locking for any reason, not just cache line boundary crossings. Split locks are the most common source of bus locks, but not the only one. Kernels 5.13 and later added the bus_lock_detect mechanism to handle this broader category, with similar warn/fatal semantics.

For virtualization hosts, the relevant tool is the split_lock_detect=fatal setting, which prevents guest VMs from imposing split lock overhead on the host. KVM exposes split lock detection to guests as well, so a properly configured stack can both protect the host and signal the guest that its code needs fixing.

AMD’s Position

AMD’s handling of split locks differs from Intel’s in some microarchitectural respects, though the architectural behavior (bus lock fallback) is the same. AMD Zen and later cores do not expose the same IA32_CORE_CAPABILITIES MSR bit for split lock detection. AMD added their own detection mechanism in later Zen generations, controlled through different means.

The practical consequence is that the Linux kernel’s split lock detection code has to check CPUID and MSR availability before enabling the feature, and the support matrix across CPU generations is somewhat fragmented. The kernel documentation for bus lock detection covers the specifics.

Avoiding Split Locks in Practice

The standard guidance is to align atomic variables to their natural size. The C11 and C++ memory model guarantees that _Atomic variables of standard sizes are suitably aligned by default, which means lock-free atomic operations on them are also naturally aligned. The problems arise from manual struct layout, packed structures, and atomics over externally-provided buffers.

The __attribute__((packed)) attribute in GCC and Clang is a common source of split locks. Packed structs disable the compiler’s normal alignment padding, which can easily push members across cache line boundaries:

// Dangerous: packed removes alignment guarantees
struct __attribute__((packed)) Header {
    char magic[3];
    _Atomic uint32_t seq;  // may not be 4-byte aligned
};

// Safer: align the atomic member explicitly
struct Header {
    char magic[3];
    char _pad;
    _Atomic uint32_t seq;  // now 4-byte aligned
};

For performance-critical code where false sharing is a concern, aligning atomics to a full cache line eliminates split locks and reduces coherence traffic simultaneously:

// Align to cache line, preventing both split locks and false sharing
typedef _Atomic uint64_t aligned_counter __attribute__((aligned(64)));

On the tooling side, address sanitizer (-fsanitize=address) does not detect split locks, but there are valgrind plugins and hardware performance counter events that can surface them. On Linux, perf stat with the bus-cycles event can give an indirect signal, and on Intel platforms with PT (Processor Trace), split lock events can be correlated to specific instruction addresses.

Why This Keeps Coming Up

Split locks are decades old as a problem, but they resurface repeatedly because the gap between “the code is correct” and “the code is fast” is invisible at the source level. Atomic operations look uniform in C and C++. The compiler does not annotate which ones will become bus locks. The performance difference is a factor of 100 or more under the worst case, but only shows up under contention, which means it often passes testing and emerges in production under load.

The Chips and Cheese investigation is valuable precisely because it quantifies this across different microarchitectures. The penalty is not uniform. How quickly a CPU can release the bus lock, how the interconnect topology affects propagation time, and whether the CPU’s prefetch and memory ordering machinery stalls during the lock all vary by generation. Knowing the rough penalty shape on Alder Lake versus Zen 3 versus a multi-socket Xeon is useful context for anyone doing performance engineering on code that touches shared memory.

The LOCK prefix is not magic. It is a contract with the hardware, and the hardware has kept fulfilling that contract through forty years of microarchitectural change. Split locks are the seam where that continuity becomes expensive, and understanding the seam is part of understanding x86 atomics at any depth.

Was this interesting?