· 6 min read ·

Writing Software That the CPU Actually Wants to Run

Source: martinfowler

There is a quote from Formula 1 driver Jackie Stewart that Martin Thompson borrowed when he named his now-famous blog: “You don’t have to be an engineer to be a racing driver, but you do have to have Mechanical Sympathy.” The idea is that understanding your machinery, even at a conceptual level, makes you better at using it. Thompson applied this to software. If you understand what the CPU and memory subsystem are actually doing, you can write code that works with them rather than against them.

This idea got concrete form in the LMAX Disruptor project around 2011, a high-performance inter-thread messaging library that squeezed extraordinary throughput from commodity hardware by being deliberate about memory layout, write ownership, and access patterns. The Mechanical Sympathy Principles article on Martin Fowler’s site by Caer Sanders distills this work into four everyday principles: predictable memory access, cache line awareness, single-writer, and natural batching.

These principles are not new, but they are routinely rediscovered, usually right after someone runs a profiler and finds that their “fast” code is spending most of its time waiting on memory. So it is worth understanding why each one works.

The Memory Hierarchy Is the Whole Story

Modern CPUs are extraordinarily fast at computation. What they are not fast at is fetching data from arbitrary locations in RAM. The gap between processor speed and memory speed has widened steadily for decades, and the CPU designers’ answer has been a hierarchy of caches.

On a typical x86-64 server in 2026, you are looking at roughly:

  • L1 cache: ~1-2 ns latency, 32-64 KB per core
  • L2 cache: ~4-7 ns latency, 256 KB to 1 MB per core
  • L3 cache: ~15-40 ns latency, 4-64 MB shared across cores
  • Main memory (DRAM): ~60-80 ns latency, effectively unlimited

A cache miss to DRAM costs you roughly 200-300 CPU cycles on a modern processor running at 3-4 GHz. In a tight loop that can dominate everything else. All four mechanical sympathy principles are ultimately strategies for keeping your data in L1 or L2 cache.

Predictable Memory Access: Stride Matters

CPUs have hardware prefetchers. These circuits observe your memory access patterns and speculatively fetch data into cache before you need it. The prefetcher is very good at detecting regular strides: if you access address N, then N+64, then N+128, it will fetch N+192 before you ask.

What it cannot do is predict random access. A linked list is the canonical example. Each node stores a pointer to the next node, and following that pointer means jumping to an arbitrary location in memory. Every node traversal is a potential cache miss.

// Linked list traversal: pointer chasing, cache-hostile
struct Node { int value; struct Node* next; };
int sum = 0;
struct Node* n = head;
while (n) { sum += n->value; n = n->next; }

// Array traversal: sequential, cache-friendly
int arr[N];
int sum = 0;
for (int i = 0; i < N; i++) { sum += arr[i]; }

The array version can be 5-10x faster on large datasets purely because of cache behavior, with no algorithmic difference. This also informs the Array-of-Structs vs Struct-of-Arrays debate. If you have a loop that only processes one field of a struct, a Struct-of-Arrays layout means that field is packed sequentially in memory, and you load only what you use. An Array-of-Structs layout forces you to load the entire struct into a cache line even if you touch one field.

Cache Lines and False Sharing

Caches do not move individual bytes; they move cache lines, which are 64 bytes wide on every mainstream architecture today. When the CPU reads a single int, it loads the surrounding 63 bytes too.

This is usually a benefit: loading nearby data for free. But it creates a problem called false sharing in multithreaded code. If two threads are writing to different variables that happen to live on the same cache line, the cache coherence protocol (MESI on x86) forces the cores to bounce that line back and forth, even though they are logically writing to independent data.

// False sharing: counter[0] and counter[1] share a cache line
long[] counters = new long[2];
// Thread 0 writes counters[0], Thread 1 writes counters[1]
// Despite no logical dependency, they serialize on the cache line

The fix is padding. Java 8 introduced the @Contended annotation (enabled with -XX:+EnableContendedPadding) specifically to pad fields to a full cache line. The Disruptor’s RingBuffer does this manually with long p1, p2, p3, p4, p5, p6, p7 padding fields. In Rust, you can use #[repr(align(64))] on a struct to guarantee cache line alignment. In C++, alignas(64) does the same.

The important thing to remember: the problem only shows up under contention. In a single-threaded program, false sharing is not a concept that exists. It is a purely concurrent phenomenon.

Single-Writer: Ownership as a Performance Principle

The single-writer principle states that only one thread should ever write to a given piece of data. Multiple readers are fine; multiple writers are not.

The reason is the same cache coherence story. When a thread writes to a cache line, the coherence protocol marks every other core’s copy of that line as invalid. If two threads are both writing to the same line, you get continuous invalidation and reloading, which is expensive. The Disruptor solves this by having a single producer write to the ring buffer and a single consumer read from it, with careful coordination via sequence numbers that are each owned by one actor.

This principle has a direct structural analog in modern language design. Rust’s ownership model enforces single-writer at compile time: &mut T is an exclusive reference, and you cannot have two of them simultaneously. This is not coincidence. Ownership systems and mechanical sympathy point at the same underlying truth: mutable shared state is expensive, whether the cost is measured in bugs or in cache coherence traffic.

Go’s channels and actor-model systems like Erlang or Akka arrive at the same place by convention rather than enforcement: pass data between goroutines or actors by transfer, not by sharing. The runtime cost of ignoring this is the same regardless of what language you are in.

Natural Batching: Letting the Hardware Amortize Work

Batching is often thought of as a throughput optimization, and it is, but the hardware-level reason it works is underappreciated. Processing items one at a time means paying setup costs repeatedly: acquiring locks, flushing queues, crossing cache boundaries. Batching amortizes those costs across many items.

The Disruptor’s ring buffer makes batching natural rather than designed. If the producer runs faster than the consumer, the consumer picks up a batch of items in a single sequence number claim, processes them without coordination, and commits once. The batching emerges from the system dynamics rather than being explicitly coded.

The same principle applies broadly. Database write-ahead logs batch commits. Network I/O uses writev() to send multiple buffers in one syscall. NVMe queues batch commands. The common thread is that the overhead of initiating an operation is high and fixed; filling the operation with more data is cheap at the margin.

In application code, this translates to preferring bulk APIs over per-item APIs when latency permits, using output buffering, and structuring work to process runs of related items together rather than interleaving unrelated work.

Where Modern Runtimes Help and Where They Do Not

Compilers and managed runtimes absorb some of this complexity. LLVM’s auto-vectorizer can rewrite scalar loops into SIMD instructions if the memory access pattern is predictable. The JVM’s JIT will hoist loop-invariant loads and eliminate redundant barriers. V8 shapes and hidden classes exist partly to keep object field layout predictable so property access can be compiled to a fixed offset rather than a hash lookup.

But compilers cannot fix fundamental layout problems. If your data structure requires pointer chasing, no amount of optimization will eliminate those cache misses. If two threads share a cache line, the hardware coherence protocol will serialize them regardless of what the compiler does. These are physical constraints, not code quality issues.

This is the core of what Caer Sanders is pointing at in the Fowler article. Mechanical sympathy is not about micro-optimization or writing unreadable code. It is about understanding which choices are cheap and which are expensive at the hardware level, so that the default choices you make are aligned with how the machine actually works. The principles are simple. The discipline is applying them before the profiler tells you that you have to.

Was this interesting?