· 6 min read ·

Split Locks on x86: The Performance Penalty Hiding in Your Struct Layout

Source: lobsters

A recent investigation at Chips and Cheese measures the microarchitectural behavior of split locks across several x86 processors. The numbers are striking, but the more interesting story is how this hardware behavior ended up where it did: a correctness guarantee from the 1980s preserved through every generation of the architecture, silently degrading performance in production systems until relatively recent kernel and CPU changes made it possible to detect them at all.

What the LOCK Prefix Normally Does

When you write an atomic operation in C:

__atomic_fetch_add(&counter, 1, __ATOMIC_SEQ_CST);

the compiler emits something like LOCK XADD [mem], eax. The LOCK prefix guarantees that the read-modify-write is atomic across all processors. On modern multi-core hardware, the CPU implements this through cache locking: the processor acquires exclusive ownership of the relevant cache line via the MESI protocol, performs the operation, and releases it. No other core can access the line in between. The cost is modest, typically 20 to 100 cycles depending on cache state and contention.

This mechanism has one hard constraint. The entire operand must fit within a single cache line. x86 cache lines are 64 bytes, aligned to 64-byte boundaries in physical memory.

The Problem at the Boundary

A split lock occurs when a LOCK-prefixed memory operation spans two adjacent cache lines. Consider an int64_t stored at offset 60 within a cache-line-aligned struct. Bytes 60 through 63 fall in the first cache line; bytes 64 through 67 fall in the next. Accessing that value without atomics is fine in single-threaded code. Adding a LOCK prefix creates a situation the cache coherency protocol cannot handle: there is no mechanism to simultaneously hold exclusive ownership of two independent cache lines.

The processor’s fallback is bus locking. On early x86, the LOCK prefix would assert the #LOCK pin on the front-side bus, physically preventing any other bus master from accessing memory until the operation completed. Modern CPUs simulate equivalent behavior on their ring and interconnect fabrics. The important point is that this serializes memory access at a much coarser granularity than a single cache line. Other cores cannot complete their own memory transactions during the bus lock window.

The performance gap is large. A cache-locked atomic on a warm cache line might cost 40 cycles. A split lock on the same data can cost hundreds to several thousand cycles. In a tight loop, one thread performing split locks continuously can reduce system-wide memory throughput for every other thread on the machine.

The Alignment Problem Is Not Always Obvious

Direct misalignment is easy to spot. The subtler form appears in struct layout:

struct message {
    uint8_t  type;
    uint8_t  flags;
    uint16_t length;
    uint8_t  payload[56];   // total offset so far: 60 bytes
    int64_t  sequence;      // offset 60, spans bytes 60-67
};

The sequence field is at offset 60, which means even if the struct begins at a cache-line-aligned address, the field crosses the 64-byte boundary. An atomic increment on sequence becomes a split lock. The field is not even naturally aligned (offset 60 is not divisible by 8), so a compiler with padding enabled would normally fix this, but packed structs or manual layout decisions can produce exactly this scenario.

The fix is to ensure atomic fields are aligned both to their natural size and, if possible, placed so they do not straddle a cache line:

struct __attribute__((aligned(64))) shared_counter {
    _Atomic int64_t value;
    char _pad[56];  // prevent false sharing with adjacent allocations
};

In C++, alignas(64) on the struct or field serves the same purpose. In Rust, the standard AtomicU64 type is naturally aligned, but placing atomics in manually packed structures or repr(C) types with explicit field ordering carries the same risks.

Intel Adds Hardware Visibility

For the first few decades of the multi-core era, split locks were a silent problem. The operation completed correctly, just slowly. Attributing latency to split locks required correlating performance counter data with source-level alignment information, which meant knowing what to look for in advance.

Starting with Ice Lake (server) and Tiger Lake (client) around 2020, Intel added split lock detection to the hardware. When the SPLIT_LOCK_DETECT bit is set in the IA32_CORE_CAPABILITIES MSR (at address 0xCF), the processor raises an #AC (Alignment Check, exception vector 17) whenever a LOCK-prefixed instruction would generate a split lock. The operation does not execute. The kernel catches the exception and decides what to do with it.

This is a meaningful change. Instead of a performance anomaly with no hardware signal, you now have a precise interrupt at the exact instruction that caused the problem.

Linux’s Kernel-Side Handling

The Linux kernel added split lock detection support in version 5.8 (August 2020), controlled via the split_lock_detect boot parameter:

split_lock_detect=off        # disable entirely
split_lock_detect=warn       # log the offender, allow it to continue
split_lock_detect=fatal      # send SIGBUS to the process
split_lock_detect=ratelimit  # warn with rate limiting

On supported kernels there is also a runtime interface:

echo warn > /sys/kernel/debug/x86/split_lock_detect

The default behavior in most distributions is to warn and allow the split lock to proceed, which preserves backward compatibility at the cost of the performance penalty. In testing environments where you want to catch and eliminate split-locking code, fatal mode is the right choice.

You can also count bus locks using the perf toolchain without kernel #AC handling:

perf stat -e bus-lock ./your_program

The bus-lock event (backed by BUS_LOCK.SELF in Intel’s performance event catalog) counts bus lock acquisitions, which includes split locks. Seeing non-zero counts on a program that should only be performing well-aligned atomics is a reliable indicator that something is misaligned.

Virtualization Makes This Worse

Split locks are a nuisance in native code. In virtualized environments they become a shared-resource problem. When a guest VM performs a split lock, the host must handle the resulting #AC exception or emulate the instruction’s behavior. Either path has overhead, and the bus serialization that occurs still affects the physical host’s memory interconnect, meaning one misbehaving guest can degrade other guests or the host itself.

KVM went through multiple iterations of split lock handling as the hardware feature matured. The current approach allows the VMM to configure whether split locks in guests raise #AC inside the guest or are handled transparently by the hypervisor. The KVM documentation covers the current state. Cloud operators running dense multi-tenant workloads have a real interest in enabling guest-facing split lock detection so noisy tenants can be identified and the code fixed.

Why This Behavior Still Exists

The reason split locks produce bus-lock behavior rather than simply faulting is backward compatibility. The Intel Software Developer’s Manual has guaranteed since the earliest documented behavior of the LOCK prefix that locked operations on misaligned addresses will complete, with whatever performance penalty is required. Removing that guarantee would break binary compatibility with software compiled against those semantics.

This is a recurring pattern in x86’s evolution. A behavior that made reasonable sense in the context of an 8086 with a simple synchronous bus becomes an awkward constraint in a system with NUMA topology, hundreds of cores, and cache-coherent interconnects. The architecture accumulates these sediment layers. The response is usually to add detection and mitigation on top rather than break the guarantee, which is why split lock detection arrived as an opt-in hardware feature four decades after the original behavior was specified.

What makes the Chips and Cheese investigation worth reading is that it moves past the specification-level description and measures what different microarchitectures actually do: how long the bus lock window lasts, what gets serialized, and whether the behavior differs between client and server parts. The spec tells you that split locks fall back to bus locking; the microarchitectural data tells you how much of your machine you are stalling when one happens.

Finding Them in Practice

For most codebases, the practical checklist is:

  • Audit any struct that contains atomic fields and has manually specified layout or padding
  • Enable split_lock_detect=warn in staging and check kernel logs under load
  • Run perf stat -e bus-lock on performance-sensitive workloads
  • Use _Alignas(64) or alignas(64) on atomic fields that are written under high contention

Split locks do not fail silently in an obvious way. Code with them ships, passes tests, and runs in production for years. The performance cost is real but diffuse enough that it rarely gets attributed correctly without specifically looking for it.

Was this interesting?