· 7 min read ·

Cache Lines, Coherence, and the Hardware Physics Behind Fast Software

Source: martinfowler

The term “mechanical sympathy” was borrowed from Formula 1. Jackie Stewart, three-time world champion, used it to describe how a great driver understood the car deeply enough to work with it rather than against it. Martin Thompson imported the phrase into software engineering around 2011, arguing that developers who understood their hardware would write fundamentally better code than those who treated it as an abstraction to be ignored.

Caer Sanders’ recent piece on martinfowler.com distills mechanical sympathy into four everyday principles: predictable memory access, cache line awareness, single-writer, and natural batching. The article is worth reading on its own terms. What it reasonably does not do is drill into why the hardware behaves this way in the first place. Understanding the underlying model makes the principles easier to apply consistently, and easier to recognize when a design is quietly violating them.

The Cache Hierarchy Is the Central Variable

Modern CPUs do not primarily bottleneck on computation. They bottleneck on memory. The processor-memory speed gap, sometimes called the memory wall, has been widening for decades. Cache hierarchies exist to bridge it, but they only help when access patterns cooperate.

A rough picture of latencies on a typical modern x86 processor:

LevelLatencySize
L1 cache1-4 cycles (~0.5-1ns)32-64KB
L2 cache10-25 cycles (~5-12ns)256KB-1MB
L3 cache40-100 cycles (~20-50ns)4-32MB
DRAM200-300+ cycles (~100ns)GBs

A cache miss to DRAM costs roughly 200 times what an L1 hit costs. Your CPU executes hundreds of instructions in the time it waits on a single miss. All four principles in Sanders’ framework are, at root, strategies for staying in the upper levels of that hierarchy.

Predictable Memory Access

The CPU includes a hardware prefetcher: circuitry that monitors your access patterns and begins fetching memory before you explicitly request it. When it can predict your next address, data arrives at the register level at the same time you need it, hiding the latency entirely. When it cannot, you stall.

Sequential access through a contiguous array is the easiest pattern for the prefetcher to handle. Constant-stride access is also predictable within limits. What defeats the prefetcher completely is pointer chasing: following a linked list, traversing a pointer-based tree, or accessing heap-allocated objects through random addresses. Each access depends on the value retrieved by the previous one, so there is no address to predict in advance.

This is the hardware reason behind the Array-of-Structs versus Struct-of-Arrays decision. Consider particle positions in a simulation:

// Array-of-Structs
struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    float lifetime;
};
Particle particles[N];

// Struct-of-Arrays
struct Particles {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
    float mass[N];
    float lifetime[N];
};

A position update loop adds velocity to position. In the AoS layout, every cache line loaded for x also contains mass and lifetime, fields never touched during that loop. In the SoA layout, the same loop streams through packed x, y, z, vx, vy, vz arrays sequentially. You load only the data you need, and the prefetcher can track all of it.

Cache Line Awareness

The CPU transfers data between cache and main memory in fixed-size chunks called cache lines. On virtually every modern x86 and ARM processor, a cache line is 64 bytes. Two variables that fit within the same 64-byte chunk share a cache line, and this has a consequential implication for concurrent code.

If two threads write to different variables that occupy the same cache line, they will thrash each other’s caches even though they share no logical state. This is false sharing, and it is one of the most common sources of mysterious parallel performance regression.

// False sharing: counter1 and counter2 on the same cache line
class Counters {
    volatile long counter1;  // bytes 0-7
    volatile long counter2;  // bytes 8-15
}

Thread A increments counter1. Thread B increments counter2. The MESI protocol that governs cache coherence on x86 allows only one core to hold a cache line in Modified state at a time. Every write by Thread B invalidates the line in Thread A’s cache; Thread A must reload it before its next access. Two logically independent counters end up serializing through the coherence protocol.

Padding resolves it:

class PaddedCounter {
    volatile long value;
    long p1, p2, p3, p4, p5, p6, p7; // pad to 64 bytes
}

Java 8 introduced @Contended (under jdk.internal.vm.annotation) to express this intent. In C and Rust, alignas(64) and #[repr(align(64))] provide equivalent layout control. The LMAX Disruptor, the canonical reference implementation for mechanical sympathy applied to a ring buffer, applies this padding throughout its data structures and documents the performance difference explicitly in its design materials.

Single-Writer

The single-writer principle is the discipline of ensuring that only one thread ever writes to a given piece of data. Multiple concurrent readers are fine. Having two writers coordinate on shared memory forces the cache coherence protocol to arbitrate, and that arbitration has a measurable cost.

When a core wants to write to a cache line, it must acquire exclusive ownership. On a multicore system this involves broadcasting an invalidation request to every other core holding a copy and waiting for acknowledgement before the write can proceed. The round-trip time for this coherence traffic scales with core count and interconnect topology, and while it is not catastrophic on small systems, it compounds under contention.

The Disruptor avoids this by assigning clear ownership: the producer holds sole write responsibility for the claim cursor, each consumer holds sole write responsibility for its own sequence counter, and these values are padded so they occupy isolated cache lines. The consequence is that the only coherence traffic generated is from consumers reading the producer’s updates, which is broadcast-friendly read traffic rather than write invalidation.

This principle has an implication that cuts against intuition: per-thread local state that gets occasionally merged can outperform continuously updated shared state, even when the shared approach seems simpler. Lock-free data structures encode this pattern internally through mechanisms like per-thread write buffers and epoch-based reclamation, regardless of whether the principle is named explicitly.

Natural Batching

Processing one item at a time is expensive relative to processing many. Cache line overhead is amortized across every item on that line. Loop setup cost, branch predictor state, and instruction pipeline overhead are all amortized over the batch size. Batching is how software voluntarily increases the granularity of its interactions with hardware.

The pattern appears across system design at every layer. TCP’s Nagle algorithm batches small socket writes into fewer, larger packets. Databases accumulate writes in a log buffer before flushing to storage. GPU kernels are designed around batching by necessity: a single kernel launch covers thousands of parallel operations. B-trees pack many keys into disk-aligned nodes to amortize I/O cost. Message consumers acknowledge in bulk rather than per-message.

In application code, the batching opportunity is often visible at the API shape before you profile anything. A function that processes one record is harder to make cache-efficient than one accepting a slice. A database call per row costs orders of magnitude more than a bulk insert. Choosing a slice-oriented API over an item-oriented one is a design-time decision with runtime consequences.

What Rust Surfaces That Other Languages Hide

Rust does not automate these principles, but it makes them more auditable. The ownership model lets the compiler reason about aliasing in ways that are unsafe in C, enabling optimization passes that would otherwise be forbidden. The #[repr(C)] and #[repr(align(N))] attributes give explicit, verifiable control over layout. The borrow checker encodes the single-writer principle in a form the compiler enforces: &mut T is a unique write reference; &T is a shared read reference.

The atomic API in Rust exposes memory ordering as an explicit argument at every operation: Relaxed, Acquire, Release, AcqRel, SeqCst. This forces you to reason about visibility semantics per-write rather than accepting blanket sequential consistency everywhere. SeqCst is the safe default, but it compiles to a full memory fence on x86 and ARM. For a producer-consumer handoff, a Release store paired with an Acquire load is both sufficient and cheaper.

Libraries like crossbeam apply these principles explicitly. crossbeam-channel uses cache-line-padded queues in its implementation, and the design choices are traceable in the source rather than hidden behind a runtime.

When This Matters

None of this is relevant if your bottleneck is a network round trip or a database query. A web request handler that spends most of its time waiting on I/O will not meaningfully benefit from false-sharing elimination in the routing layer. Mechanical sympathy pays off in tight loops, in systems where tail latency is a product requirement, in infrastructure code that others will depend on at scale, and in compute-bound workloads where you are genuinely measuring memory throughput.

For everyone else, it is useful background that occasionally surfaces as the explanation for an unexpected performance cliff. The four principles Sanders names in the Martin Fowler article are a compact vocabulary for the question worth asking when something is surprisingly slow: is this design working with the hardware, or asking the hardware to compensate for the design?

Was this interesting?