Cache Lines, Single Writers, and the Hardware Contract Your Code Ignores

The term mechanical sympathy comes from Formula 1. Jackie Stewart, the Scottish driver who won three world championships in the late 1960s and early 70s, used it to describe the relationship the best drivers had with their cars: they understood the machine well enough to work with it rather than fight it. Martin Thompson borrowed the phrase for software through his Mechanical Sympathy blog, writing about what it means for code to cooperate with the hardware it runs on. That framing has resurfaced in a new article by Caer Sanders on Martin Fowler’s site, which distills the practice into four everyday principles: predictable memory access, cache line awareness, single-writer, and natural batching.

The four principles are not new discoveries. They’ve been lurking in Thompson’s LMAX Disruptor work, in Ulrich Drepper’s 2007 paper “What Every Programmer Should Know About Memory”, and in the documentation of every major HPC library for decades. What the Sanders article does is bring them down to earth, making them applicable outside high-frequency trading or kernel development. That’s worth expanding on, because these principles do surface in ordinary software, and the underlying hardware reasons are usually skipped.

The Cost Model Underneath

Modern CPUs are not fast in an absolute sense. They are fast for certain access patterns and slow for others, and the gap between those regimes is large. An L1 cache hit costs roughly 4 cycles on a modern x86-64 core. An L2 hit is around 12. L3 ranges from 40 to 50 cycles. A main memory access, where the CPU must wait on DRAM, costs 200 to 300 cycles, sometimes more on multi-socket NUMA systems where the memory lives on a different physical node. A fault that requires a TLB miss and page table walk can reach thousands of cycles.

These numbers aren’t incidental to hardware design; they reflect physics. SRAM that fits on-die close to execution units is fast, expensive, and small. DRAM that lives on a separate chip is large but slow. The cache hierarchy is a bet that locality holds: if you accessed address X recently, you’ll probably access addresses near X soon. The four principles are four ways to make that bet pay off.

Predictable Memory Access

Hardware prefetch engines watch your memory access patterns and speculatively load cache lines before you request them. A sequential scan through an array, stride 1, is the ideal case. The prefetcher detects the pattern within a few iterations and begins pulling data ahead of the execution pipeline, hiding most of the DRAM latency. Stride-N patterns, accessing every Nth element, still work but degrade as N grows. Beyond a stride of roughly 32 bytes, many prefetchers begin losing effectiveness. Pointer chasing, following a linked list through scattered heap nodes, is the worst case: each access reveals the address of the next, so the prefetcher has no useful signal until the previous load completes. Traversing a heap-allocated linked list can be 10x slower than traversing a contiguous array of the same length.

This shapes which data structures make sense for performance-critical code. Consider two ways to lay out particle simulation data:

// Array of Structs
struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    int   active;
};
Particle particles[N];

// Struct of Arrays
struct Particles {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
    float mass[N];
    int   active[N];
};

A loop that updates only positions using velocities will, under the Array of Structs layout, load mass and active into cache on every iteration even though it never reads them. Those fields consume cache lines and displace data the loop actually needs. The Struct of Arrays layout lets the loop stride through only x, y, z, vx, vy, vz, keeping the working set small and the prefetcher happy. The performance difference in tight numerical loops can easily be 3-5x. This is the design choice behind the Entity Component System architecture used in game engines like Bevy and EnTT.

A CPU cache does not transfer individual bytes. It transfers cache lines: 64 bytes on x86-64, 64 bytes on most ARM chips, but 128 bytes on Apple Silicon M-series processors. Reading a single byte brings in the surrounding 63 bytes alongside it. Usually this is beneficial, reflecting spatial locality. The problem arises in concurrent programs when two threads write to distinct variables that happen to share a cache line.

The cache coherency protocol on Intel CPUs, MESI (Modified, Exclusive, Shared, Invalid), requires that when one core writes to a cache line, all other cores holding that line in Shared or Exclusive state must invalidate it. When thread 1 writes to field a and thread 2 writes to adjacent field b, and both live on the same 64-byte cache line, each write invalidates the other core’s copy. The line bounces across the core-to-core interconnect continuously, even though the threads are writing to logically independent data. This is false sharing, and it can reduce throughput by a factor of 10 or more on highly contended data.

The fix is padding. Java 8 introduced @Contended for this purpose:

@jdk.internal.vm.annotation.Contended
class PaddedLong {
    volatile long value;
}

C++17 added std::hardware_destructive_interference_size, the portable way to get the cache line width at compile time:

struct alignas(std::hardware_destructive_interference_size) Counter {
    std::atomic<long> value;
};

The LMAX Disruptor pads its Sequence class extensively for exactly this reason. Producer and consumer sequence numbers live in different cache lines so writes from one side don’t invalidate cache state on the other.

The Apple Silicon point deserves separate mention. Code tuned on x86 with 64-byte padding assumptions can silently reintroduce false sharing on M-series hardware where lines are 128 bytes. The std::hardware_destructive_interference_size constant is supposed to handle this portably, but it was only standardized in C++17, and plenty of codebases still use hardcoded 64-byte alignment.

Single Writer

The single-writer principle follows from cache coherency analysis. Every write to a shared cache line, regardless of locking, generates coherency traffic if any other core has a copy. The more cores that write to the same line, the more traffic flows through the interconnect, and the more each write costs in latency.

Martin Thompson articulated the principle clearly: if a single thread owns writes to a piece of data, coherency traffic for that data essentially disappears. The owning thread writes at full speed. Other threads read. As long as the reader uses appropriate memory ordering, no lock is needed and no cache bouncing occurs.

The Disruptor implements this structurally. Each producer owns exactly one sequence counter; only that producer writes it. Each consumer owns exactly one consumer sequence; only that consumer writes it. The ring buffer entries themselves are written by exactly one producer per slot and read by each consumer once. The resulting throughput, measured at tens of millions of events per second on commodity hardware in the original LMAX benchmarks, flows directly from eliminating write contention.

The broader implication is that reaching for a mutex is sometimes a sign that ownership hasn’t been thought through. A mutex says: multiple threads can write this, but only one at a time. The single-writer principle says: redesign so that exactly one thread writes this, and the mutex disappears. The latter is usually faster and always simpler to reason about, because it eliminates a class of potential deadlocks and contention scenarios rather than managing them.

Actor systems, including Erlang’s process model and Akka in the JVM, build on this idea at a higher level. Each actor owns its mutable state exclusively. External parties send messages rather than touching the state directly. The model trades some memory overhead for structural elimination of data races.

Natural Batching

Batching amortizes fixed costs over variable amounts of work. The fixed costs depend on context. A Linux system call costs on the order of a few hundred nanoseconds under normal conditions, more under Spectre/Meltdown mitigations that force expensive IBRS transitions. A network round trip can be microseconds to milliseconds depending on topology. A PostgreSQL fsync, which must wait for the storage device to confirm durability, takes 5 to 10 milliseconds on spinning disk and still costs measurable time on NVMe. Running 1000 single-row inserts, each awaited individually, can be 100x slower than a single bulk insert of 1000 rows.

The Linux kernel has added batching APIs specifically because the demand is universal. sendmmsg() lets you deliver multiple UDP datagrams in one syscall. io_uring goes further: it provides a shared submission ring that lets userspace queue arbitrary I/O operations and dispatch them with a single io_uring_enter call. The io_uring design borrows directly from the Disruptor pattern, using a lock-free ring buffer with single-writer semantics for each queue to keep overhead minimal.

The Sanders article distinguishes natural batching from forced batching, and the distinction matters. Forced batching collects work into a buffer and flushes on a timer or when the buffer fills, introducing latency in exchange for throughput. Natural batching processes all currently available work in a single pass: when you reach an I/O boundary, you send everything that’s waiting rather than sending items one by one as they arrive. The Disruptor’s consumer loop does this; it drains the ring to the current published sequence in one batch rather than processing entries one at a time and sleeping between each.

The Principles Reinforce Each Other

Each principle addresses a distinct layer of the hardware cost model. Predictable access works with the prefetcher. Cache line awareness controls the granularity of coherency traffic. Single-writer eliminates that traffic for mutable state. Natural batching amortizes the fixed cost of I/O and OS boundaries.

They also compose. A design that assigns ownership clearly, one thread per data structure, naturally enables batching: the owner can drain an entire inbox of pending operations before writing results, applying both principles simultaneously. Padding the sequence numbers in that design prevents the owner’s writes from false-sharing with readers’ observations of those numbers, applying the cache line principle on top.

None of this requires writing in C or working on a trading system. A Go developer choosing slices over linked lists, a Rust developer structuring game components in contiguous arrays, a Java developer padding shared counters, a Node.js developer batching database writes: all of them are applying mechanical sympathy. The hardware cost model is the same regardless of language. The question is whether the code acknowledges it.