Split Locks on x86: Why a Misaligned Atomic Can Stop Every Core

Split locks are one of those x86 behaviors that most concurrent code never triggers, but when it does, the consequences are severe enough to warrant operating system intervention. The Chips and Cheese investigation covers the hardware side in detail, measuring how specific microarchitectures behave under split lock conditions. What is worth exploring further is why this mechanism exists at all, how modern hardware evolved to detect it, and what the software stack does in response.

Cache Lines and the MESI Protocol

A cache line on modern x86-64 processors is 64 bytes. The processor moves memory between RAM and caches in these 64-byte chunks, and the MESI cache coherence protocol operates at cache line granularity. When you execute a LOCK ADD or LOCK CMPXCHG, the processor needs to guarantee that no other agent in the system can observe an intermediate state for the operation. For an operand that fits within a single cache line, this is handled at the cache level: the processor obtains exclusive ownership of the cache line by transitioning it to Modified state in the MESI protocol, performs the operation, and releases ownership. No other core can observe a partial write because they cannot access a cache line held in Modified state by another core.

This breaks down when the operand spans two cache lines. A 64-bit integer placed at offset 0x38 (56 bytes) within a cache line extends 8 bytes and crosses the boundary at 0x40. The processor cannot atomically own two cache lines through the normal coherence mechanism because MESI operates on one line at a time. Instead, it falls back to asserting the physical memory bus lock, a hardware signal that predates modern cache coherence entirely and traces back to the original 8086. This signal causes every other processor in the system to cease memory operations until the bus is released.

The System-Wide Cost

The performance cost of a split lock is not simply that one operation runs slowly. The bus lock stalls all CPU cores for its duration, not just the cores accessing nearby memory. A single misaligned atomic operation in one thread stalls every other thread across all cores until that operation completes.

Measured penalties vary by microarchitecture. On various Intel parts, a split lock can cost several hundred to over a thousand cycles compared to a handful of cycles for an equivalent aligned atomic. Beyond the operation itself, the stall inflicted on other cores is non-deterministic, depending on what those cores were doing when the bus lock was asserted. For real-time workloads or latency-sensitive systems this is not merely a performance concern; it becomes a correctness concern. A guest VM performing split locks can cause latency spikes on entirely unrelated workloads running on the same host, because the bus lock has no concept of isolation boundaries.

This system-wide effect is why the feature attracted attention from the kernel and hypervisor communities rather than just CPU architects and compiler writers.

How Split Locks Appear in Practice

The most common source is misaligned data combined with atomic operations. The packed attribute in C and C++ removes struct padding and is a frequent culprit:

struct __attribute__((packed)) record {
    uint8_t  flags;
    uint64_t timestamp;  // offset 1 from struct start, naturally misaligned
    uint32_t id;
};

If timestamp is accessed atomically and the struct happens to sit at an unfortunate position in memory, a split lock can result. The fix is to restore natural alignment:

struct record {
    uint8_t  flags;
    uint8_t  _pad[7];    // padding to align timestamp to 8 bytes
    uint64_t timestamp;
    uint32_t id;
};

Using alignas in C11/C++11 makes the intent cleaner and lets the compiler manage the padding:

typedef struct {
    uint8_t      flags;
    _Alignas(8)  uint64_t timestamp;
    uint32_t     id;
} record;

The _Alignas specifier is generally preferable to manual padding fields because it documents intent explicitly and survives struct member reordering during refactoring.

A less obvious but significant source is CPU emulation. When QEMU emulates a non-x86 guest architecture, it must faithfully implement that architecture’s memory model, including atomic operations on potentially misaligned addresses if the guest ABI permits them. On x86 hosts, those emulated atomics become LOCK-prefixed instructions, and if the guest data crosses a cache line boundary, the host CPU performs a split lock. Wine faces the same issue with certain Windows programs that rely on relaxed alignment guarantees from the Windows x86 ABI. This is one reason split lock detection matters particularly in virtualization contexts: a single misbehaving guest can degrade latency across an entire host machine.

JIT compilers that synthesize LOCK-prefixed instructions at runtime, such as those in managed runtimes implementing compare-and-swap for object headers, can also generate split locks if they place objects without sufficient alignment guarantees.

Hardware Detection in Modern Intel CPUs

For most of x86’s history, split locks were silent. They happened, they were slow, and software had no way to know unless someone instrumented the code or noticed the latency. Intel changed this with split lock detection hardware introduced in Tremont, the Atom-based microarchitecture released in 2019, and carried into Tiger Lake and later client parts.

The mechanism is built on top of the #AC (alignment check) fault infrastructure. Setting the AC flag in EFLAGS normally causes the processor to raise #AC on misaligned data accesses in user mode, which serves as a debugging tool. Intel extended this: when split lock detection is enabled via a model-specific register, a LOCK-prefixed instruction with an operand crossing a cache line boundary raises #AC before asserting the bus lock. The operating system catches the fault and can respond without the system-wide stall actually occurring.

For virtualization, Intel added Bus Lock VM Exit to VMX. When this control is set in the VMCS, a split lock inside a guest triggers a VM exit rather than asserting the bus lock on the host. The hypervisor can emulate the operation, inject a fault into the guest, or take other action. KVM supports this and exposes it to userspace via the KVM_CAP_X86_BUS_LOCK_EXIT capability, giving hypervisors the ability to handle or penalize guests that generate split locks.

AMD processors do not have equivalent split lock detection hardware. Split locks on AMD still incur the bus lock penalty, but there is no hardware fault mechanism for the OS to intercept them before the lock is asserted. This gap in detection capability, rather than the underlying split lock behavior itself, is one of the more practically significant divergences between the two ISA implementations today.

Linux Kernel Support

Linux added split lock detection support, controlled by the split_lock_detect boot parameter and documented at kernel.org. Three modes are available:

off: detection disabled, split locks proceed with their normal bus lock penalty
warn: the kernel logs a rate-limited warning on the first split lock per task and allows the operation to continue
fatal: the kernel sends SIGBUS to the offending process

The implementation sets the AC flag in EFLAGS when entering user mode on CPUs that support split lock detection. When user code triggers a split lock, the #AC handler fires, the kernel identifies it as a split lock fault rather than a standard misalignment fault, and responds according to the configured mode.

The warn mode is particularly useful operationally. Deployed on a production system, it surfaces split lock events in the kernel log without disrupting running processes. Once you have identified which processes generate split locks, you can make a deliberate decision about whether to enforce fatal mode or fix the alignment in the affected code.

Kernel-mode split locks are handled separately and with less tolerance. A split lock in kernel code or a driver is a bug, and the kernel treats it accordingly regardless of the user-space split_lock_detect setting.

The Broader Pattern

Split lock detection went from a theoretical concern to an operational one as modern Intel processors gained the hardware to expose it. The Linux kernel now ships with detection support on capable hardware, and hypervisors are expected to configure VM Exit behavior to protect host latency from guest-induced split locks.

This reflects a broader pattern in x86: behaviors that were always technically present but rarely encountered in practice become worth instrumenting once the hardware gets fast enough that the exceptions stand out clearly, and once the deployment environments (dense multi-tenant servers, real-time systems) become sensitive enough to care. The bus lock behavior of split locks existed in the 8086. It took forty years and modern cloud infrastructure to make it worth building a kernel subsystem around.

For application developers, the practical guidance is straightforward: do not use packed structs for data accessed atomically, align atomic variables to their natural alignment, and be aware that the position of an atomic within a struct matters if the struct itself may not be cache-line aligned. For anyone writing emulation, JIT compilers, or FFI layers that synthesize LOCK-prefixed instructions at runtime, validating alignment before issuing those instructions is the only reliable way to avoid split locks.

The Chips and Cheese investigation shows what split locks look like at the microarchitecture level, with latency numbers and cache hierarchy behavior across different CPU generations. Pairing that hardware-level perspective with the OS-level detection and handling described here gives a more complete picture of a behavior that is quiet in well-aligned code and consequential when it is not.