When Your Atomic Operation Locks the Whole Machine

The LOCK prefix on x86 is one of those primitives you learn early and then mostly forget, because in ordinary circumstances it just works. You write a compare-and-swap, the compiler emits lock cmpxchg, and the hardware enforces atomicity. What the textbooks skip is that there are two entirely different mechanisms behind that guarantee, and one of them is catastrophically expensive in ways that are nearly invisible until something goes wrong at scale.

The distinction comes down to cache lines and alignment. Modern x86 processors have 64-byte cache lines. When an atomic operation targets a naturally aligned location that fits within a single cache line, the processor implements the lock using cache coherency protocols: it acquires exclusive ownership of the cache line via MESI transitions and completes the operation without touching the memory bus. This is fast. The operation is local to the cache hierarchy and does not block other cores unless they are contending on the same line.

A split lock happens when an atomic operation spans two cache lines. The classic example is a lock cmpxchg targeting an 8-byte value at offset 56 within a 64-byte cache line: bytes 56 through 63 are in the first cache line, and bytes 0 through 7 of the next cache line hold the rest. The processor cannot hold both cache lines exclusively through MESI simultaneously. The only fallback available is the old bus lock: the processor asserts the #LOCK signal on the memory bus, freezes all other memory traffic across the entire system for the duration, performs the read-modify-write, and releases. Freezing all memory traffic system-wide is what makes a split lock fundamentally different from a cache-locked atomic operation, and it is why Chips and Cheese’s investigation into the actual hardware behavior is worth studying carefully.

Bus Locks in a Modern Multicore System

In the era of a single front-side bus connecting all cores to a unified memory controller, a bus lock was expensive but at least locally scoped in conceptual terms. Modern x86 machines are NUMA systems with per-core L1 and L2 caches, shared L3 caches, ring buses or mesh interconnects on server parts, and independent memory controllers. A bus lock in this context does not merely serialize two threads contending on one address. It serializes all memory operations across all cores for the duration of the locked instruction.

The penalty numbers reflect this. A properly aligned lock xadd with no contention completes in roughly 20 to 40 cycles on a modern Intel core, dominated by cache coherency overhead. A split lock executing the same logical operation can cost hundreds to thousands of cycles, depending on the microarchitecture and the contention state of the system. More importantly, that cost is not paid solely by the thread issuing the split lock. Every thread on every core that attempts a memory operation during the bus lock stalls. You pay for your misaligned atomic with everyone else’s throughput.

Intel’s Software Developer’s Manual describes the LOCK prefix semantics without distinguishing the performance cliff between these two paths. The spec says the operation is atomic; it says nothing about which mechanism the hardware uses or what the cost difference is. The ISA contract is correctness, not performance, and the bus-lock path exists precisely to satisfy that contract when the cache-lock path is unavailable.

Detection Came Later Than It Should Have

For most of x86’s history, split locks were silent. Software executed them, paid the penalty, and the overhead was essentially invisible. No hardware counter flagged them. No exception fired. The CPU just did what the spec required and moved on.

Intel added split lock detection starting with the Tremont microarchitecture and carried it forward into Tiger Lake and subsequent generations. The mechanism is bit 29 of the IA32_TEST_CTRL MSR at address 0x33. When that bit is set, executing a split-locked instruction raises an #AC (alignment check) exception instead of performing the silent bus lock. This was designed as a diagnostic feature, not default behavior, because raising exceptions for every split lock in production software would break codebases that had unknowingly relied on the silent-but-correct path for years.

Linux picked this up in kernel 5.8 with the split_lock_detect boot parameter. The kernel can be configured to warn (emit a message to dmesg and continue), fatal (kill the offending process with SIGBUS), or off (disable detection entirely). The default posture shifted over kernel releases as maintainers gathered data on how common split locks were in real workloads.

For those who want measurement without exception-based detection, Intel PMU events work on Skylake and later:

# Count split lock events for a workload
perf stat -e mem_inst_retired.split_locks ./target_binary

# Check if your kernel was compiled with split lock support
grep SPLIT_LOCK /boot/config-$(uname -r)

A Cloud-Scale Problem

Split locks become a substantially worse problem in virtualized environments. A bus lock issued inside a guest VM does not stay inside the VM. The CPU has no knowledge of virtualization at the point where it decides to assert the bus lock signal. The lock propagates to the physical host, stalling memory operations across the entire physical machine, including every other VM co-located on that host.

One tenant running code with split locks can degrade throughput for all co-tenants. Cloud providers have had to treat this as a security concern alongside a performance one, because it provides a mechanism for a guest to cause measurable system-wide slowdowns without any elevated privilege. KVM added handling for guest split locks; the kvm.split_lock_detect parameter mirrors the native options, giving hypervisor operators the ability to warn on, kill, or silently allow split locks originating from guests.

Intel and AMD Behave Differently

AMD processors do not guarantee bus-lock behavior for split atomic operations in the same way Intel’s architecture does. AMD’s handling has varied across microarchitecture generations. Some Zen-era parts handle split atomics internally within the cache coherency fabric without asserting a system-wide bus lock, which means the same misaligned code can perform acceptably on AMD and catastrophically on Intel, or vice versa depending on the generation. This is exactly the kind of implementation-specific divergence that the Chips and Cheese analysis is designed to surface: measuring real hardware rather than trusting the spec to describe performance.

Where Split Locks Come From in Practice

The practical fix is correct alignment. An atomic operation on a type of size N will never produce a split lock if the address is aligned to N bytes, provided N is a power of two and no larger than the cache line size. Standard C and C++ compilers align std::atomic<T> to alignof(T) automatically, so ordinary stack and heap allocations of atomic variables are safe.

The dangerous cases come from manual struct layout, packed attributes, shared memory mapped at arbitrary offsets, and pointer arithmetic that does not account for alignment:

#include <stdatomic.h>
#include <stdalign.h>

// Packed structs strip natural alignment padding
#pragma pack(1)
struct packed_header {
    uint8_t version;
    atomic_uint64_t sequence;  // likely at offset 1, not aligned to 8
};
#pragma pack()

// Let the compiler insert padding, or be explicit
struct safe_header {
    uint8_t version;
    uint8_t _pad[7];
    atomic_uint64_t sequence;  // now at offset 8, properly aligned
};

// For cache-line isolation (also prevents false sharing)
struct cache_line_counter {
    alignas(64) atomic_uint64_t value;
};

Shared memory segments are another common source. mmap returns a page-aligned base address, which is safe. But casting a pointer at an arbitrary byte offset within that segment to an atomic type without checking alignment is not. If two processes share a region and each independently decides where to place an atomic control word, they need to agree on alignment explicitly, not assume it.

// Unsafe: offset within shared segment may not be aligned
atomic_int *ctrl = (atomic_int *)((char *)shm_base + header_size);

// Safe: round up to alignment requirement before casting
size_t aligned_offset = (header_size + _Alignof(atomic_int) - 1)
                        & ~(_Alignof(atomic_int) - 1);
atomic_int *ctrl = (atomic_int *)((char *)shm_base + aligned_offset);

What the Hardware Analysis Adds

The value of empirical microarchitecture investigation, the kind Chips and Cheese regularly publishes, is that it replaces spec-reading with measurement. The x86 ISA tells you what a split lock is and that it will be atomic. It does not tell you that the penalty varies by a factor of ten or more across microarchitecture generations, that the stall propagation differs between Intel and AMD, or that contention patterns on adjacent cores interact with the bus lock in ways that compound the cost.

For anyone writing concurrent systems code, the practical conclusion is that alignment is not a style preference or a minor optimization. A misaligned atomic in a hot path is a serialization point for every core on the machine. The hardware executes it correctly and silently, which means you will not see a crash or a correctness failure, only a throughput cliff that is difficult to attribute without the right tooling. Getting the alignment right before deployment is considerably cheaper than diagnosing it afterward.