The Bus Lock Hangover: How Misaligned Atomics Can Stall Every Core on Your Machine
Source: lobsters
There is a class of performance bug on x86-64 that is nearly impossible to detect during development, trivial to produce accidentally, and capable of serializing every CPU on a server simultaneously. Split locks combine all of those properties, sitting at the intersection of LOCK prefix semantics, cache line geometry, and three decades of backward compatibility obligations; they only became detectable by the operating system in 2019.
The Chips and Cheese investigation digs into the observable hardware behavior of split locks across CPU generations. What their measurements reveal is worth understanding at the mechanism level, because the numbers make more sense once you know what the hardware is actually doing under the hood.
Cache Locking and Its Limits
The x86 LOCK prefix appears on instructions like lock xadd, lock cmpxchg, and lock add. When your compiler emits __atomic_fetch_add() for an x86 target, one of these is what it generates. The prefix instructs the CPU to guarantee that the read-modify-write is atomic: no other agent observes a partial state.
The fast path for satisfying this guarantee is cache locking. If the target address fits entirely within a single 64-byte cache line and the CPU holds that line in Modified or Exclusive state under the MESI coherence protocol, the processor can perform the entire atomic operation locally. The line is logically locked for the duration, no external coordination is needed, and the overhead is modest: on the order of 10 to 40 nanoseconds under contention.
Split locks break this path. They occur when a LOCK-prefixed instruction’s operand straddles a cache line boundary. If a 64-bit atomic add targets an address at offset 60 within a cache line, four bytes fall in one line and four bytes fall in the next. The cache locking mechanism covers a single line; it cannot guarantee atomicity across two simultaneously. The CPU falls back to a bus lock.
/* Deliberately producing a split lock: */
char *buf = aligned_alloc(64, 128);
/* bytes 60-67 cross the 64-byte cache line boundary at offset 64 */
uint64_t *p = (uint64_t *)(buf + 60);
__atomic_fetch_add(p, 1, __ATOMIC_SEQ_CST);
What a Bus Lock Actually Does to the System
A bus lock, in the original x86 design, meant asserting the physical LOCK# pin on the front-side bus. Every other device on the bus, including other CPUs and DMA controllers, was prevented from issuing memory cycles until the pin was released. Modern point-to-point interconnects (Intel’s DMI and UPI links, AMD’s Infinity Fabric) implement the equivalent behavior differently, but the logical effect is identical: one CPU claims exclusive access to the memory subsystem, reads both cache lines containing the split operand, performs the modification, writes both lines back, and then releases the lock.
The overhead is not a modest multiplier over the cache-lock path. The typical range for a split-locked operation is 1,000 to 10,000 nanoseconds, compared to 10-40 for an aligned cache-locked one. The ratio is 100 to 1,000. More importantly, that overhead is not local to the thread executing the split lock. Every other CPU that attempts a memory access during the bus lock window stalls. On a 64-core server with one thread issuing split locks in a tight loop, the aggregate throughput loss across all other threads extends far beyond what the offending thread’s own cycle count suggests.
Virtualization sharpens this problem considerably. A guest VM can issue split locks that assert bus locks on the physical host, stalling every other VM’s memory traffic. Before kernel and hypervisor support for split lock detection, a motivated guest tenant on a shared host could use this as a sustained degradation vector against co-residents, with no mechanism for the host to detect or attribute the cause.
Why x86 Has This Problem and RISC Architectures Do Not
The root cause is ISA philosophy. The x86 LOCK prefix decouples the atomicity guarantee from alignment requirements. x86 allows misaligned memory accesses in general, and for decades it extended that permissiveness to locked accesses: the hardware would make them work, just slowly.
RISC architectures took the opposite position. ARM64’s atomics use load-exclusive/store-exclusive pairs (ldxr/stxr) or the LSE instruction set extension’s cas family; both require natural alignment enforced at the ISA level, and an unaligned attempt generates a fault. RISC-V’s lr/sc reservation mechanism has the same alignment requirement. IBM POWER’s lwarx/stwcx. instructions operate under similar constraints. These architectures moved the correctness boundary earlier: the ISA mandates alignment, and programmers either comply or receive a synchronous fault.
x86 preserved the “we will handle it regardless” contract because it was introduced for a world of single-chip 8086 processors connected to shared address buses. The LOCK# pin made sense in 1985. It became a backward-compatibility obligation for every subsequent generation, and the cost only became visible as core counts scaled from two to sixty-four.
Intel’s Split Lock Detection Feature
Intel added SPLIT_LOCK_DETECT support with the Tremont microarchitecture (Atom-class, 2019) and Ice Lake (Core and server-class, 2019). The feature is exposed via CPUID leaf 7, subleaf 0, EAX bit 29, and enabled by setting the corresponding bit in MSR_TEST_CTRL (MSR address 0x33).
When enabled, a split lock raises #AC, the Alignment Check exception (interrupt vector 17), rather than silently issuing a bus lock. This behavior is distinct from the ordinary alignment check: the standard #AC fires only in ring 3 with EFLAGS.AC set, for plain misaligned accesses. The split lock #AC fires in both ring 0 and ring 3 regardless of EFLAGS.AC, and fires specifically on misaligned locked accesses, not misaligned loads or stores in general. The kernel’s exception handler has to discriminate between these two reasons for the same vector, since they arrive identically from the CPU’s perspective.
AMD’s processors produce the same performance degradation for split locks but have not documented an equivalent detection mechanism. The bus lock fallback operates identically; there is simply no published MSR or CPUID bit to ask AMD hardware to raise #AC on split lock attempts.
Linux Kernel Handling
Linux added split lock detection support in kernels 5.7 and 5.8 (2020), controlled by the boot parameter split_lock_detect=. The modes are:
off: Detection disabled.warn: Rate-limiteddmesgoutput, process allowed to continue.fatal: SIGBUS delivered to the offending userspace process; panic if the fault occurs in kernel mode.ratelimit: Warn with throttling to avoid log flooding.
The implementation in arch/x86/kernel/cpu/intel.c detects the CPUID feature on boot and enables it per-core via wrmsrl(MSR_TEST_CTRL, val | MSR_TEST_CTRL_SPLIT_LOCK_DETECT). The #AC handler in arch/x86/kernel/traps.c then routes the fault to the appropriate action. If a split lock occurs in ring 0, the kernel treats it as a kernel bug and panics or warns loudly, since there is no justification for kernel code to issue misaligned locked accesses.
KVM gained the ability to expose the feature to guests and enforce it at the hypervisor level, which closed the cross-VM degradation vector. A host administrator can configure the system to inject a #AC into any guest VM that issues split locks, or to deliver SIGBUS inside the guest, before the bus lock propagates to physical hardware and stalls unrelated workloads on the same machine.
Where Split Locks Come From in Practice
Under normal conditions, split locks do not appear. Standard allocators return at least 8-byte-aligned memory, typical system allocators guarantee 16-byte alignment, and the C and C++ standards require that alignof(T) is respected for all types. Struct layout inserts padding to maintain member alignment automatically.
The situations where split locks surface in real code are specific:
- Structs annotated with
__attribute__((packed))suppress alignment padding and can push any member to an arbitrary offset within the struct. - Custom arena allocators that track allocation offsets without alignment rounding produce misaligned objects without any warning.
- Pointer arithmetic into a raw byte buffer followed by a cast to an atomic type, common in network protocol parsers and binary deserialization code.
- Interop with packed binary formats where the wire layout does not match the platform’s alignment expectations.
The compiler does not warn about this. Sanitizers do not catch it without custom allocator instrumentation. With split_lock_detect=fatal on a staging or CI system, the resulting SIGBUS surfaces the problem immediately; without it, the code is correct but potentially orders of magnitude slower under contention, and the cause will not appear obviously in a profiler trace. A lock that should cost 20 nanoseconds costing 5,000 nanoseconds looks like cache contention or scheduler interference, not a misalignment bug.
The Broader Pattern
Split locks are a case study in the long tail of architectural compatibility decisions. x86 preserved semantics that were appropriate for a single-core, shared-bus era, and the penalty compounded as core counts scaled. Detection arrived in 2019, thirty-four years after the LOCK# pin semantics were introduced. The kernel infrastructure to act on detection took several more years to mature, with KVM support arriving in Linux 5.12.
The constraint for systems programmers is straightforward: use standard allocators, do not apply packed attributes to structs containing members accessed atomically, and treat pointer casts into raw byte buffers with alignment skepticism. For x86 as an architecture, the debt is already accepted: RISC ISAs that enforced alignment from the beginning never accumulated this particular liability, and the cost of maintaining the permissive x86 contract continues to be paid in silicon complexity, OS kernel code, and the occasional unexplained latency spike on a production machine.