· 6 min read ·

When Atomics Cross a Cache Line: The High Cost of Split Locks on x86

Source: lobsters

Most developers working with concurrent code have a rough mental model of atomic operations: they are fast when uncontended, somewhat expensive under contention, and generally cheap compared to a mutex. That model is mostly right, but it has a blind spot. If an atomic instruction on x86 targets memory that straddles a 64-byte cache line boundary, the CPU cannot use cache-based locking at all. It falls back to a mechanism that predates multicore systems entirely: asserting the physical bus lock signal, halting every other core’s memory access until the operation completes.

That is a split lock, and the Chips and Cheese investigation into their behavior across different microarchitectures is a good excuse to go deeper on why they exist, how bad they actually are, and how surprisingly easy they are to stumble into.

Cache Line Locking vs. Bus Locking

When the x86 LOCK prefix accompanies an instruction like LOCK XADD or LOCK CMPXCHG, the CPU needs to guarantee that the read-modify-write is atomic with respect to all other processors. In modern multicore systems, this is handled through the cache coherency protocol. The cache line containing the target address is brought into the L1 cache in an exclusive state (the M or E state in MESI/MESIF), the operation executes, and no other core can observe an intermediate state because the cache coherency protocol serializes ownership. This is called a cache lock, and it is relatively cheap: it does not involve the memory bus at all, just the coherency fabric.

The problem is that a cache lock only works when the entire operand fits within a single cache line. On x86-64, cache lines are 64 bytes. A 4-byte access at offset 62 within a cache line would span bytes 62-63 of one line and bytes 0-1 of the next. Locking two cache lines simultaneously with MESI semantics is not how the protocol works; each cache line is an independent unit of ownership transfer.

So x86 falls back to the original mechanism: the LOCK# bus signal. The CPU physically asserts this line on the memory bus, which prevents any other bus master from performing memory transactions until the signal is deasserted. On a single-core machine with a discrete bus, this was the only option, and it was fine. On a 128-core server or a desktop with hyperthreading, it is a global pause. Every hardware thread waiting on memory stalls until one misaligned atomic finishes.

How Bad Is It

The performance penalty varies by microarchitecture and by how frequently the split lock is hit, but the numbers are not subtle. Intel’s own documentation describes split-locked accesses as taking roughly 1000 cycles on older hardware compared to a handful of cycles for a well-aligned atomic in L1. Empirical benchmarks across different generations tend to show latency multipliers of 100x to 1000x for the locking thread alone, and that ignores the collateral cost to every other thread on the system.

In a server environment hosting multiple tenants or VMs, a split lock in one guest can degrade throughput across the entire host. The hypervisor itself often has to serialize VCPU execution around split locks, amplifying the damage beyond a single VM’s workload. This is one reason kernel and hypervisor developers treat split lock detection as a correctness and fairness concern, not just a performance hint.

How You Accidentally Write a Split Lock

The canonical path to a split lock is a misaligned structure containing an atomically-accessed field. Consider:

struct metrics {
    char prefix[3];
    _Atomic int32_t counter;  // starts at offset 3
};

If struct metrics is allocated at a 4-byte aligned address (valid by C alignment rules for the struct), counter sits at offset 3 within the struct. An allocation at address 0x40 places counter at 0x43, which spans 0x43-0x46, crossing the cache line boundary between 0x00-0x3F and 0x40-0x7F. Every atomic_fetch_add on that counter is a split lock.

Packed structures make this worse:

#pragma pack(1)
struct packed_header {
    uint8_t  version;
    uint32_t sequence;   // may be at offset 1, almost certainly misaligned
    _Atomic uint64_t timestamp;  // could be anywhere
};

Rust is not immune. #[repr(packed)] structs combined with atomics produce the same issue. The compiler often warns about taking references to fields of packed structs, but the warning is suppressible and the resulting behavior depends on allocation alignment at runtime.

Custom allocators and memory pools are another source. An allocator that packs objects to minimize waste may align allocations to the object’s natural alignment rather than to a cache line boundary. A uint64_t at address 0x38 (aligned to 8 bytes but 8 bytes from the end of its cache line) causes a split lock for 8-byte atomic operations.

Detection: From Hardware to the Kernel

For a long time, split locks were silent. The CPU handled them, paid the cost, and the programmer never knew. Intel changed this starting with Tremont and subsequent microarchitectures by adding Split Lock Detection (SLD), a hardware feature that can raise an #AC (alignment check) exception or a #DB (debug exception) when a split-locked instruction is about to execute.

Linux added support for this in kernel 5.8 via the split_lock_detect boot parameter and the /sys/kernel/debug/x86/split_lock_mitigate interface. The available modes are off, warn, fatal, and ratelimit. In warn mode, the kernel prints a message and forces a brief delay on the offending task. In fatal mode, the process receives SIGBUS. The ratelimit mode limits log noise when a buggy binary hits the path repeatedly.

The kernel’s implementation handles an important complication: some legitimate software, including certain hypervisors and binary translators, uses split locks deliberately (usually for legacy compatibility reasons). The detection machinery therefore has to be opt-in or administrator-configurable rather than a hard fault universally.

On AMD processors, the behavior differs. AMD CPUs have generally not implemented the same bus-lock-based fallback with equal severity, and some AMD generations handle misaligned atomics differently at the microarchitectural level. The Chips and Cheese investigation is notable for poking at real hardware across vendors and generations to see where the penalties actually fall, since the architecture manuals describe the behavior but are less precise about cycles.

Catching This in Your Code

The most reliable static defense is explicit alignment. In C:

_Atomic int64_t counter __attribute__((aligned(64)));  // cache-line aligned

In Rust:

#[repr(align(64))]
struct AlignedCounter {
    value: AtomicI64,
}

Padding to a full cache line also prevents false sharing, which is a different problem but often addressed at the same time.

For dynamic allocation, aligned_alloc (C11) or posix_memalign can ensure that buffers of atomics start on a cache line boundary. Rust’s global allocator guarantees alignment to the type’s align_of, but for atomics accessed from multiple threads, you typically want align_of to be at least 8 and ideally 64.

At runtime, Intel’s performance monitoring units expose split lock events under names like MEM_INST_RETIRED.SPLIT_STORES and MEM_INST_RETIRED.SPLIT_LOADS. perf stat -e mem_inst_retired.split_stores on a Linux system with SLD-capable hardware will count them. A non-zero count in performance-sensitive code is worth investigating.

Why This Still Matters

Split locks feel like a historical artifact. They date from an era when the memory bus was a physical shared resource that needed explicit arbitration. In practice, they are still a live concern for several reasons.

First, the instruction set preserves backward compatibility absolutely. Any code that ran on an 8086 can still run on a modern Xeon, and the semantics of LOCK on a misaligned operand are defined. The CPU will never silently drop the lock guarantee, so the fallback is permanent.

Second, serialization bugs from misaligned atomics are not always obvious in testing. A single-threaded test or a test with low thread counts may never hit the cache line boundary in a way that causes measurable slowness. The problem manifests under load, on specific allocation patterns, or on hardware with particular NUMA topologies.

Third, code that uses atomics for performance-sensitive coordination, such as lock-free queues, reference counting in runtimes, or metrics collection, is exactly the code that cannot afford a 1000-cycle penalty per operation. The performance model the developer used when designing the algorithm assumed cache-locked atomics.

The x86 architecture’s commitment to compatibility means that split locks are not going away. Hardware detection makes them visible, and modern kernels make them auditable, but the underlying mechanism is an immovable consequence of designing a bus-lock signal before cache coherency protocols existed and then building 40 years of software on top of that foundation.

Was this interesting?