· 6 min read ·

Writing Code That Works With Hardware, Not Against It

Source: martinfowler

The term mechanical sympathy was borrowed from motorsport. Jackie Stewart, the Formula 1 champion, argued that a driver doesn’t need to be a mechanical engineer, but should understand how the car works well enough to feel when something is wrong and avoid fighting the machine. Martin Thompson adapted the phrase for software in his Mechanical Sympathy blog and later in the design of the LMAX Disruptor, a high-throughput inter-thread messaging library built entirely around how modern CPUs actually behave.

Caer Sanders has written a companion piece on Martin Fowler’s site that distills mechanical sympathy into four everyday principles. Rather than restate them, I want to go one level deeper on each one — explaining the hardware mechanism behind the principle and showing where it surfaces in code that isn’t obviously systems-level.

Why the hardware gap exists

Modern CPUs are extraordinarily fast. A contemporary x86 core can execute several instructions per nanosecond. Main memory, by contrast, takes roughly 60-100 nanoseconds per access. That gap — roughly two orders of magnitude — is bridged by the cache hierarchy: L1 cache at around 4 cycles, L2 at 12, L3 at 40, and DRAM sitting far behind everything else.

The CPU does not wait passively for data. It runs a hardware prefetcher that watches your access patterns and loads data into cache before you ask for it — but only if those patterns are predictable. When they are not, you pay the full latency penalty, and the execution pipeline stalls.

Four principles emerge from this reality.

Predictable memory access

The prefetcher recognizes sequential and strided access patterns. If you iterate over a contiguous array, the prefetcher can load ahead and hide most of the latency. If you chase pointers through a linked list, every node dereference is an independent address the prefetcher cannot anticipate.

This is the reason array-based data structures almost always outperform pointer-linked ones in practice, even when the algorithmic complexity is identical. A linked list traversal with O(n) steps may be dominated by O(n) cache misses; the same traversal over an ArrayList is dominated by a single sequential scan.

A concrete example in Java:

// Cache-friendly: sequential scan, prefetcher works well
int[] values = new int[1_000_000];
long sum = 0;
for (int v : values) sum += v;

// Cache-hostile: pointer chasing, each node at an arbitrary address
Node head = buildLinkedList(1_000_000);
long sum = 0;
for (Node n = head; n != null; n = n.next) sum += n.value;

Benchmarks consistently show the array version running 3-10x faster for large sizes, not because of any algorithmic difference, but because the hardware can help with one and not the other.

Cache line awareness

CPUs do not load individual bytes; they load cache lines, typically 64 bytes at a time on x86. This has two consequences worth keeping in mind.

The first is spatial locality. If you access array[i], you get array[i..i+7] (for 8-byte longs) essentially for free. Struct-of-arrays layouts exploit this: if you have a million objects but a hot loop only touches one field, storing all those fields together in a contiguous array means the loop loads exactly what it needs. Array-of-structs layouts interleave unneeded fields into every cache line, wasting bandwidth.

The second consequence is false sharing. When two threads write to different variables that happen to occupy the same cache line, the coherence protocol forces those threads to contend — each write invalidates the other’s cached copy, triggering cross-core traffic. The threads are not sharing data logically, but they are sharing hardware.

public class Counters {
    // These two longs share a cache line. Writers on separate cores
    // will thrash the coherence protocol.
    long counter1;
    long counter2;
}

public class PaddedCounters {
    long counter1;
    // 7 longs of padding to fill the rest of the 64-byte cache line
    long p1, p2, p3, p4, p5, p6, p7;
    long counter2;
}

Java 8 introduced @sun.misc.Contended (stabilized as @jdk.internal.vm.annotation.Contended) precisely to handle this. The JVM inserts padding around annotated fields automatically. The LMAX Disruptor uses this extensively on its sequence counters.

Single-writer

This principle comes directly from Thompson’s work on the Disruptor. The MESI coherence protocol (Modified, Exclusive, Shared, Invalid) governs how caches across cores stay consistent. When a core wants to write to a cache line, it must first acquire exclusive ownership, which means sending invalidation messages to every other core holding a copy and waiting for acknowledgment.

If multiple threads write to the same cache line, they generate a continuous stream of invalidations. Even if they use atomic operations or locks correctly, they are paying coherence overhead on every write.

The single-writer principle sidesteps this entirely: each piece of data has exactly one designated writer. Other threads may read it freely — reads in the Shared state generate no coherence traffic — but no two threads ever compete to write the same location.

In the Disruptor, each sequence counter is owned by exactly one producer or one consumer. The result is that concurrent throughput scales with the number of cores rather than collapsing under contention. Gil Tene’s presentations on JVM concurrency and Martin Thompson’s talks from the GOTO conferences are good sources if you want the full breakdown.

This principle also appears at the architectural level in systems like the Actor model and in Rust’s ownership system. Rust makes single-writer a compile-time guarantee: the &mut T type means exclusive mutable access, and the borrow checker enforces that no two owners hold a mutable reference simultaneously. The hardware principle and the language semantics map onto each other directly.

Natural batching

Modern CPUs have SIMD units (SSE, AVX on x86; NEON on ARM) that can apply one instruction to multiple data elements simultaneously. AVX-512 operates on 512-bit registers, meaning sixteen 32-bit floats per instruction. The compiler can autovectorize tight loops, but only when the loop body is simple and the data is laid out contiguously.

Beyond SIMD, batching amortizes fixed per-operation costs. A syscall, a memory allocation, a lock acquisition — each has overhead that dominates when you pay it per item. Paying it once per batch of items shifts the cost into irrelevance.

// Write one byte at a time -- pays write syscall overhead per byte
for (int i = 0; i < n; i++) {
    write(fd, &data[i], 1);
}

// Write in one call -- same bytes, a fraction of the overhead
write(fd, data, n);

The Disruptor batches at the consumer level too. A consumer doesn’t claim one slot at a time; it claims all available slots up to the producer’s current sequence. Under load, this means consumers process items in natural clusters, and the branch predictor sees a tight loop rather than sporadic individual dispatches.

Network I/O libraries like io_uring on Linux generalize this into a submission queue model: you enqueue multiple operations and submit them in one syscall. The per-operation overhead drops from microseconds to nanoseconds.

These principles show up at every layer

It is tempting to think mechanical sympathy is only relevant when writing C++ or tuning a JVM runtime. In practice, the same principles propagate upward.

A Python data pipeline that repeatedly appends to a list inside a loop, then passes the list to NumPy, is violating natural batching — the work should flow into NumPy as a bulk operation from the start. A Node.js service that maintains per-request state scattered across many small objects is violating predictable access. A Go program that spawns goroutines writing to shared struct fields is risking false sharing.

The hardware does not know what language you are using. It executes instructions and fetches memory, and it rewards patterns that let it do that efficiently.

Understanding the mechanism behind each principle — why the prefetcher needs sequential access, why cache lines make padding necessary, why single-writer eliminates coherence traffic, why batching amortizes fixed costs — gives you enough grounding to recognize when your code is working with the hardware and when it is working against it. You do not need to be a chip designer to write software with mechanical sympathy, but you should understand the machine well enough to feel when something is wrong.

Was this interesting?