When Cache Lines Collide: The Real Cost of x86 Split Locks

The x86 LOCK prefix makes a clean semantic promise: make this read-modify-write operation atomic with respect to all other processors. What the ISA does not advertise is that the cost of honoring that promise varies by several orders of magnitude depending on where in memory your data happens to live.

A recent investigation at Chips and Cheese looked at what happens on x86-64 when a LOCK-prefixed instruction crosses a cache line boundary, a condition the architecture calls a split lock. The findings illustrate something that most concurrent programming guides skip entirely: the LOCK prefix is not a uniform primitive. It is a spectrum, and the cheap end and the expensive end are not close to each other.

Cache Lines and the MESI Protocol

Modern x86-64 processors organize main memory access into 64-byte cache lines. When a core performs a locked atomic operation on properly aligned data, the hardware uses the MESI (Modified, Exclusive, Shared, Invalid) cache coherency protocol to resolve atomicity without touching the memory bus directly. The core upgrades its cache line to the Exclusive or Modified state, performs the operation, and the protocol guarantees no other core can hold a conflicting copy simultaneously. This path is fast. A LOCK XCHG on well-aligned data typically completes in 40 to 100 nanoseconds, roughly on par with a cross-core cache miss.

MESI handles atomicity cleanly because it makes a fundamental assumption: the operation targets a single cache line. The moment that assumption breaks, the hardware falls back to a mechanism that predates cache coherency by decades.

What Split Locks Actually Do

When a LOCK-prefixed instruction accesses memory that spans two cache lines, the CPU cannot use cache coherency alone to guarantee atomicity. Both cache lines would need to be locked simultaneously, and MESI has no concept of a multi-line atomic transaction. So the processor falls back to asserting the LOCK# signal, the original mechanism for enforcing atomicity across multiple processors before coherency protocols existed.

On a modern multi-core system, a bus lock stalls all other cores from accessing memory for the duration of the operation. Not just the two affected cache lines. The entire memory subsystem. Every thread on every core waits.

The performance difference is significant. Where an aligned atomic costs 40 to 100 nanoseconds, a split lock can run into the microseconds. On a loaded system with many threads, the impact compounds: each split lock from one thread serializes memory access for all others, producing a bottleneck that grows worse as core count increases. A bug that was nearly invisible on a dual-core desktop in 2008 can cripple throughput on a 64-core server.

The Long History of Bus Locking

Bus locking is not a new fallback. It is the original mechanism. The Intel 8086, introduced in 1978, had a physical LOCK pin that could freeze the external bus during a read-modify-write cycle. Early symmetric multiprocessing systems relied entirely on this approach. Cache coherency protocols emerged later as a way to achieve atomicity with much lower overhead, and over time they became the standard path for locked operations on well-aligned data.

The expectation, once MESI became ubiquitous, was that bus locking would be rare in practice. That expectation held reasonably well through the era of handwritten assembly and careful low-level C. It became less reliable as atomic operations moved up the abstraction stack.

C++11 introduced std::atomic. Java’s java.util.concurrent package made lock-free data structures approachable. Rust’s std::sync::atomic built atomics into the language’s safety model. All of these abstractions give programmers a clean semantic interface without exposing the underlying alignment requirements. A developer writing std::atomic<uint32_t> has no obvious reason to think about which byte of a cache line that integer occupies.

Packed structs, certain compiler layouts, JVM object layouts, and misaligned heap allocations can all produce split locks without any indication in source code. The instruction executes. The result is correct. The only symptom is performance degradation, often spread across threads in a way that makes the root cause hard to trace without careful measurement.

Intel’s Detection Hardware

Intel eventually addressed this by adding a hardware detection feature, introduced with the Tremont microarchitecture and present in subsequent client and server designs. The feature allows the CPU to trap on split lock attempts, giving the OS a chance to respond.

The Linux kernel integrated support for this capability in the 5.8 release cycle, with patches contributed by Intel engineers. The kernel exposes control through the split_lock_detect boot parameter and a runtime sysctl:

# Emit a kernel log warning on each split lock
echo "warn" > /proc/sys/kernel/split_lock_detect

# Send SIGBUS to the offending process
echo "fatal" > /proc/sys/kernel/split_lock_detect

# Rate-limited warnings, less noisy in production
echo "ratelimit" > /proc/sys/kernel/split_lock_detect

The warn mode is useful during development. Run your workload, then check dmesg for split lock reports. The kernel logs the offending process name and PID, which gives you a starting point for investigation.

AMD processors do not have the same MSR-based trapping capability. On AMD hardware, split locks still incur the serialization penalty but happen silently. The performance cost is real regardless of whether the hardware can tell you about it.

Finding Split Locks Without the Trap

On Intel microarchitectures that expose the relevant PMU counters, perf can surface split memory accesses even without the kernel trap:

perf stat -e mem_inst_retired.split_loads,mem_inst_retired.split_stores ./your_binary

Split loads and split stores do not necessarily imply split locks, since the LOCK prefix is required for the worst-case bus lock behavior. But frequent split accesses in hot paths indicate alignment problems worth investigating, and they often correlate with locked operations on the same data.

For Java workloads, the JVM’s object layout can place fields at offsets that cause alignment issues depending on how the allocator arranges objects. The -XX:+PrintFieldLayout JVM flag (available in some distributions) can help identify struct layout. Tools like JOL (Java Object Layout) let you inspect the in-memory layout of specific classes at runtime.

Fixing the Problem at the Source

Once you identify a split lock, the fix is usually straightforward. The goal is ensuring that the atomic variable lives entirely within one cache line:

// Problem: packed struct pushes the atomic to an odd offset
struct __attribute__((packed)) bad_layout {
    uint8_t  flag;
    uint32_t counter;  // starts at byte offset 1, misaligned
};

// Fix: natural alignment through padding
struct good_layout {
    uint8_t  flag;
    uint8_t  _pad[3];
    uint32_t counter;  // now at byte offset 4, naturally aligned
};

// Fix: enforce cache-line alignment on the struct
struct __attribute__((aligned(64))) aligned_layout {
    uint8_t  flag;
    uint32_t counter;
};

For heap allocations, standard C11 provides aligned_alloc, and POSIX provides posix_memalign:

void *buf;
posix_memalign(&buf, 64, size);  // 64-byte aligned

// C11
void *buf = aligned_alloc(64, size);

In C++, alignas works at the type level:

struct alignas(64) hot_counter {
    std::atomic<uint32_t> value;
    // rest of struct guaranteed not to share a cache line with another instance
};

For shared-memory IPC scenarios where you cannot control the allocator, you may need to add runtime alignment checks or document the alignment requirement as a contract for callers.

Why This Persists

Split locks are the kind of bug that survives for years because they are invisible at the level where most debugging happens. The program computes the right answer. Tests pass. The issue only appears as throughput degradation under load, and even then the symptom, a high-contention bottleneck, looks like many other concurrency problems.

The kernel detection feature and the microarchitecture-level analysis that investigations like the Chips and Cheese piece provide are valuable precisely because they make the invisible visible. Knowing that the x86 LOCK prefix has a fast path and a catastrophically slow fallback, and knowing what triggers the fallback, changes how you approach data structure layout in concurrent code.

Alignment discipline was once the automatic habit of anyone writing in assembly or low-level C, baked in by necessity. As atomic operations moved into higher-level abstractions, that discipline became easy to lose. The hardware has not become more forgiving on this front. The number of cores sharing that serialized bus has only grown. The gap between getting the semantics right and getting the performance right is wider than it looks.