· 7 min read ·

Cache Lines, Single Writers, and the Hardware Physics Behind Fast Software

Source: martinfowler

The phrase “mechanical sympathy” comes from Formula 1 racing. Jackie Stewart, three-time world champion, used it to describe what separated great drivers from merely fast ones: the best could feel what the car was doing beneath them, working with its mechanical behavior rather than fighting it. Martin Thompson borrowed the term around 2011 to describe the same quality in software engineers. The best systems programmers, he argued, understand what the hardware is doing underneath their code. His blog became foundational reading in the Java high-performance community.

Caer Sanders has now distilled that body of thinking into four everyday principles on Martin Fowler’s site: predictable memory access, cache line awareness, single-writer, and natural batching. The framing is practical rather than theoretical, aimed at working engineers who want to build these habits into their daily decision-making rather than waiting for a performance crisis.

The value of naming these principles clearly is that each one maps directly to a specific hardware mechanism. Understanding why each rule exists makes it easier to apply correctly, and easier to recognize when the rule is being violated.

The Memory Hierarchy Is Not Uniform

Everything in mechanical sympathy flows from one foundational fact: memory access is not uniform. The gap between an L1 cache hit (~0.5 ns) and a DRAM access (~100 ns) is roughly 200x. Between L1 and a network round-trip, it is six orders of magnitude. Software that treats all memory access as equivalent is leaving most of the hardware’s speed on the table.

Modern CPUs compensate with hardware prefetchers. These circuits watch for sequential or strided memory access patterns and begin loading cache lines before the program requests them. When code walks a contiguous array, the prefetcher keeps the L1 cache stocked and the processor rarely stalls. When code follows pointer chains, every dereference is a potentially new random address, and the prefetcher gives up. The processor stalls at DRAM latency for each node.

This is why the choice between a linked list and an array matters beyond algorithmic complexity. Two data structures with identical O(n) traversal costs can differ by 10-100x in wall-clock time depending on whether the prefetcher can help. The canonical example is array-of-structs versus struct-of-arrays layout:

// Array of structs: accessing only 'x' loads y and z into cache uselessly
struct Particle { float x, y, z, mass; };
Particle particles[N];
for (int i = 0; i < N; i++) total_x += particles[i].x;

// Struct of arrays: pure sequential stream, prefetcher works at full speed
struct Particles { float x[N], y[N], z[N], mass[N]; };
for (int i = 0; i < N; i++) total_x += particles.x[i];

The struct-of-arrays version can be 2-4x faster on hot loops because each cache line carries eight useful floats instead of one. Game engines and scientific computing code have used this layout for decades; it is only recently that it has become standard vocabulary in general application development.

Cache Lines and the False Sharing Trap

The atomic unit of transfer between cache levels is not a byte or a word. It is a 64-byte cache line. When a CPU core modifies any byte within a cache line, the coherence protocol (MESI on x86) marks that entire line as modified on that core and invalidated on all others. Any other core that subsequently needs any byte in that line must wait for the owning core to flush it.

False sharing occurs when two threads on different cores write to different variables that happen to occupy the same cache line. Neither thread is sharing data in any logical sense, but the hardware treats the line as the unit of ownership, so their writes are serialized through the coherence bus regardless.

// counter0 and counter1 are adjacent fields — same 64-byte cache line
public class Counters {
    volatile long counter0 = 0;
    volatile long counter1 = 0;
}
// Thread A increments counter0, Thread B increments counter1
// Result: ~10x worse throughput than single-threaded, because every write
// forces the other core to reload the line from L3 or DRAM

The fix is ensuring each field occupies its own cache line. Java 8 introduced @jdk.internal.vm.annotation.Contended, which instructs the JVM to insert 128 bytes of padding automatically (enabled with -XX:-RestrictContended). C++17 provides std::hardware_destructive_interference_size, typically 64. The LMAX Disruptor is meticulous about this: its producer sequence, each consumer’s sequence counter, and the ring buffer entries themselves are all padded to eliminate false sharing. This is one reason the Disruptor reports median latencies around 52 ns compared to roughly 32,000 ns for java.util.concurrent.ArrayBlockingQueue under equivalent load, as documented in the original Disruptor paper.

Single-Writer: Eliminating Coordination by Design

The single-writer principle states that any given piece of mutable data should be written by exactly one thread. Reads can come from anywhere, but ownership of mutation is singular. This eliminates write-write contention without requiring locks or compare-and-swap operations.

The cost of CAS under contention is often underestimated. On a single thread, AtomicLong.getAndIncrement() runs at a few nanoseconds per operation. With eight threads contending on the same counter, total throughput can drop below what a single thread achieves, because every failed CAS must retry, and the cache coherence traffic from competing writes serializes through the memory bus. The CAS becomes a bottleneck that scales inversely with thread count.

Single-writer architecture removes this entirely. The Disruptor assigns each field exactly one owning thread. The producer claims sequence numbers via a single CAS (or no CAS at all in single-producer mode), then writes its slot without further coordination. Each consumer writes only its own sequence counter. Readers are free. The actor model formalizes the same idea at a higher level: each actor owns its state, all mutations flow through a single-threaded message loop, and no actor ever writes another’s state directly. Erlang and Akka both rest on this foundation.

The practical implication for application code is to be deliberate about write ownership. A shared configuration object modified from multiple threads is a violation of single-writer semantics even when guarded by a ReadWriteLock. A better design pushes writes through a single owner and distributes the updated value via a reference swap, letting readers observe the new state without coordination.

Natural Batching: Amortizing the Fixed Costs

Every operation with a fixed setup cost benefits from batching. Syscalls carry roughly 1,000 ns of kernel-transition overhead regardless of payload size. A database INSERT statement and a batch of 1,000 INSERT statements can share a single network round-trip. Linux writev() and sendmmsg() exist precisely to push this amortization down to the OS interface. These are all instances of the same principle: amortize the fixed cost across more work.

The natural batching principle, as Thompson and the Disruptor team observed, does not always require explicit batching code. When a consumer processes items from a shared queue, it can read the latest published sequence once and drain all available items in a tight loop before reading the sequence again. Under any non-trivial load, multiple items will be available. The batch size emerges organically from the production rate.

// Disruptor consumer: drain all available items before re-checking sequence
long availableSequence = sequenceBarrier.waitFor(nextSequence);
while (nextSequence <= availableSequence) {
    T event = ringBuffer.get(nextSequence);
    handler.onEvent(event, nextSequence, nextSequence == availableSequence);
    nextSequence++;
}
// One volatile read of availableSequence amortized across the entire batch

This loop is a sequential scan over a pre-allocated array. The prefetcher runs at full speed. The single volatile read is amortized over every item in the batch. Kafka’s consumer design follows the same pattern: each poll() call retrieves a batch of records from the local buffer, and the fetch from the broker happens in the background. The application processes a batch per iteration of its event loop rather than one record per blocking call. PostgreSQL’s group commit applies the same logic to fsync(): multiple transactions’ log records are written to disk together, spreading the 5-10 ms flush cost across dozens of concurrent writers.

The Principles Reinforce Each Other

What makes mechanical sympathy a coherent practice rather than a collection of disconnected tips is that these four principles are mutually reinforcing. Sequential memory access works best when data is laid out predictably, which is easier when one writer controls the layout. Natural batching over sequential data means the prefetcher assists with each batch. Cache line isolation prevents batching throughput from being undermined by coherence traffic on metadata fields.

The Disruptor demonstrates this integration concretely. Its ring buffer is a contiguous, pre-allocated array (predictable memory access). Fields owned by different threads are padded to their own cache lines (cache line awareness). Each mutable field has exactly one writing thread (single-writer). Consumers drain all available slots per iteration rather than one at a time (natural batching). The result is a concurrent queue that sustains tens of millions of messages per second with sub-microsecond latency on commodity hardware.

Ulrich Drepper’s What Every Programmer Should Know About Memory (2007) remains the most thorough technical treatment of the hardware side. The Disruptor technical paper from 2011 is the best demonstration of all four principles integrated into a real system.

The habit to build is asking four questions before writing any concurrent or performance-sensitive code: which thread owns each piece of mutable data, whether memory access will be sequential or random, whether operations can be drained in batches, and whether any fields will sit adjacent to fields owned by other threads. Those questions, applied consistently, account for most of the gap between code that uses the hardware and code that ignores it.

Was this interesting?