Mechanical Sympathy From First Principles: One Hardware Constraint, Four Design Rules

The phrase “mechanical sympathy” comes from Formula 1 racing. Jackie Stewart, three-time world champion in the late 1960s and early 70s, used it to describe how the best drivers understood their cars deeply enough to work with them rather than against them. A driver does not need to be a mechanical engineer, but must feel what the car wants to do.

Martin Thompson borrowed the phrase around 2011, applied it to software at LMAX, and built the LMAX Disruptor to prove the idea worked in production. His Mechanical Sympathy blog became the canonical home for these ideas. Caer Sanders’ recent article on Martin Fowler’s site distills the practice into four everyday principles: predictable memory access, cache line awareness, single-writer, and natural batching.

The useful thing to understand about these four is that they are not independent observations. They all reduce to one hardware fact. Understanding that fact makes the principles obvious rather than arbitrary, and it tells you where they apply and where they do not.

The fact is this: CPUs do not load individual bytes from RAM. They load 64-byte blocks called cache lines. On x86-64, ARM, and POWER processors, 64 bytes is the unit of transfer between main memory and every level of the cache hierarchy. Apple Silicon M-series extended this to 128 bytes, which matters and will come up later.

The cache hierarchy gives that fact its teeth:

L1 cache: ~1 nanosecond
L2 cache: ~4-12 nanoseconds
L3 cache: ~13-50 nanoseconds
DRAM: ~60-100 nanoseconds

The gap between L1 and DRAM is a 50-200x latency difference. A modern CPU executes hundreds of instructions in the time it waits on a single main memory fetch. The four mechanical sympathy principles are all strategies for staying in cache and minimizing how often you pay DRAM latency.

Predictable Memory Access

The hardware prefetcher is circuitry that monitors access patterns and speculatively loads cache lines before they are requested. When you iterate sequentially through an array, the prefetcher recognizes the pattern immediately and keeps the CPU fed with data. When you follow pointers through a linked list, each dereference arrives at an address the prefetcher cannot predict, because the address is only known after the previous load completes. The CPU stalls.

The performance gap is not subtle. Sequential array traversal versus random pointer-chasing produces 8-25x throughput differences for working sets that fit in L3. For large datasets that overflow L3, the difference reaches 100-200x. std::vector versus std::list in C++ and Vec<T> versus Box<Node<T>> chains in Rust are the most common expressions of this in everyday code.

The structural equivalent in data layout is the choice between array-of-structs and struct-of-arrays:

// Array-of-Structs: a position update loop loads mass and type
// even though it never touches them
struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;       // never read during position update
    int type;         // never read during position update
};
Particle particles[N];

// Struct-of-Arrays: mass and type never enter cache during position updates
struct Particles {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
    float mass[N];
    int   type[N];
};

The algorithm is identical in both cases; the data layout determines whether each 64-byte cache line delivers all useful bytes or wastes half of them on fields the current loop never reads. Mike Acton’s CppCon 2014 talk on data-oriented design measured 10-50x differences in game engine inner loops from layout changes alone. Entity Component System architectures in game engines like Bevy and EnTT are built around this observation.

The MESI coherence protocol governs how multiple CPU cores share cache state. Each cache line exists in one of four states: Modified (one core has the only valid copy after a write), Exclusive (one core has the only copy, clean), Shared (multiple cores have clean copies), and Invalid (stale, must be fetched).

Reads in the Shared state generate no coherence traffic. Writes do: the writing core must broadcast an invalidation to every other core holding a copy, wait for acknowledgments, and then proceed. This is the expensive path, and it explains false sharing.

False sharing occurs when two threads write to different variables that happen to live on the same 64-byte cache line. The variables are logically independent, but the cache line is the unit of coherence. The MESI protocol does not know that core A wrote bytes 0-7 and core B wrote bytes 8-15; it only sees that two cores are writing to the same 64-byte region and forces serialization. The two threads end up with effective single-thread throughput even though they never touch the same data.

// Bad: requests and errors share a cache line
class Counters {
    volatile long requests;
    volatile long errors;
}

// Fixed: each counter owns its cache line
// JDK 8+ lets the JVM insert padding automatically
class ContendedCounter {
    @jdk.internal.vm.annotation.Contended
    volatile long value;
}

In C++17, std::hardware_destructive_interference_size provides the portable constant. In Rust, the crossbeam-utils crate offers CachePadded<T>, and #[repr(align(64))] handles it manually. One complication: Apple Silicon M-series processors use 128-byte cache lines. Code padded to 64 bytes silently reintroduces false sharing on those machines. The C++17 constant handles this correctly; hard-coded 64 does not.

The performance impact of false sharing scales badly with core count. More cores means more holders of a cache line, more invalidations to broadcast, more acknowledgments to collect. What looks like a minor inefficiency on a 4-core laptop can serialize throughput on a 64-core server. Linux’s perf c2c tool identifies this at symbol granularity by counting cross-core cache-to-cache transfers, and it often surfaces problems that show up nowhere in CPU utilization metrics.

The Single-Writer Principle

Martin Thompson articulated the single-writer principle directly from the MESI analysis: if exactly one thread owns writes to a piece of data, coherence traffic for that data essentially disappears. The owning thread writes at full speed in the Modified state. Other threads read from Shared copies with no coherence traffic. No lock is needed, no cache bouncing, no invalidation broadcasts.

The LMAX Disruptor is the proof of concept. Each sequence counter has exactly one writer. Producers claim slots through a claim cursor they alone write. Each consumer maintains its own sequence counter, exclusively owned. These counters are cache-padded against each other to prevent false sharing. The result: approximately 25 million messages per second on commodity hardware, compared to roughly 4-5 million for java.util.concurrent.ArrayBlockingQueue. The difference comes from eliminating coherence traffic, not from algorithmic changes.

Single-writer does not eliminate the need for memory ordering. Readers must observe writes correctly. Acquire/release semantics on atomic operations enforce visibility without mutual exclusion. In Java, volatile provides happens-before guarantees without the weight of locking. Rust’s ownership system makes the same constraint explicit at the type level: &mut T means exclusive mutable access, compiler-enforced, while &T allows shared reads. The language semantics and the hardware coherence model describe the same requirement in different vocabularies.

Higher-level manifestations of the principle appear in actor systems (Erlang, Akka), where each actor exclusively owns its mutable state, and in per-thread data structures that merge results periodically rather than updating shared counters continuously.

Natural Batching

Every expensive operation has a fixed overhead component that costs nearly the same regardless of payload size. A Linux syscall costs hundreds of nanoseconds baseline. A network round trip costs microseconds to milliseconds. A PostgreSQL fsync costs 5-10 milliseconds on spinning disk. Calling these per-item rather than per-batch is often the difference between handling thousands of operations per second and handling dozens.

The Disruptor’s consumer loop demonstrates what Sanders calls natural batching:

long availableSequence = sequenceBarrier.waitFor(nextSequence);
while (nextSequence <= availableSequence) {
    event = ringBuffer.get(nextSequence);
    eventHandler.onEvent(event, nextSequence, nextSequence == availableSequence);
    nextSequence++;
}

The consumer checks once how far ahead the producer is, then drains everything in a tight loop before checking again. Under high load, natural batch sizes grow automatically. Under low load, latency stays minimal because there is no timer-based delay waiting to fill a buffer. The batching emerges from the system’s actual state rather than being imposed by configuration.

Linux’s io_uring applies the same idea to I/O: queue arbitrary operations into a shared submission ring, flush with a single io_uring_enter call. Per-operation syscall overhead drops from microseconds to nanoseconds. The mechanism is a lock-free ring buffer with single-writer semantics, the Disruptor pattern expressed as kernel infrastructure.

SIMD vectorization is a batching effect at the instruction level. When a compiler can issue AVX2 or AVX-512 instructions, a simple arithmetic loop processes 8-16 floats per instruction rather than one. A non-inlined function call inside a loop prevents this entirely by creating an alias-analysis barrier the compiler cannot reason across. With inlining enabling AVX2, per-element costs drop from roughly 1.0 ns to 0.08-0.12 ns.

What Happens When All Four Apply at Once

Go’s Green Tea garbage collector is a production example of all four principles addressing the same bottleneck simultaneously. The old GC maintained a worklist of heap object addresses. Marking required following each address to a scattered heap location: random access, unpredictable, cache-hostile. At production scale, roughly 35% of GC marking CPU time stalled waiting on main memory.

The redesign replaced the object-address worklist with a page-level FIFO queue. Marking scans contiguous per-page bitmaps sequentially, which the prefetcher handles without difficulty. Each bitmap has a single designated writer. GC processes all marks in a span before moving to the next, batching the work naturally. On machines with AVX-512, a VGF2P8AFFINEQB instruction processes 64 bytes of bitmap data at once.

The results: typical workloads see 10% reduction in GC marking CPU time. Allocation-heavy workloads see up to 40% reduction. No algorithm changed; only the data layout and access pattern changed. It is available as GOEXPERIMENT=greenteagc in Go 1.25, and it ships as default behavior in Go 1.26. The gains compound because all four principles addressed the same constraint from different angles rather than different constraints.

Measurement Before Optimization

Sanders’ article frames these principles as guides for everyday work rather than as techniques to reach for after a performance crisis. That framing is correct. A data layout decision made early is much cheaper to get right than to fix later. But the principles also need measurement to apply correctly.

perf stat -e cache-misses gives a quick miss ratio; more than 5-10% on compute-heavy workloads usually indicates a layout problem worth investigating. perf c2c identifies false sharing at symbol granularity. Cachegrind provides source-line granularity without hardware counters, which makes it usable in CI and virtual machines. Google Benchmark’s range sweeps reveal the inflection points where L1, L2, and L3 capacity limits change throughput.

Ulrich Drepper’s 2007 paper “What Every Programmer Should Know About Memory” remains the technical foundation beneath all of this. It is long and detailed, and most of what Sanders describes traces back to sections in that document. The principles are not new; what changes is how prominently hardware constraints factor into everyday design decisions as software keeps finding new ways to leave performance on the table.