The phrase “mechanical sympathy” comes from Jackie Stewart, the Formula 1 world champion who argued that a great driver doesn’t need to be a mechanic, but does need to understand the car well enough to drive with it rather than against it. Martin Thompson borrowed the phrase for software, and it describes something specific: the gap between code that treats hardware as an opaque abstraction and code that understands what the hardware is doing at a physical level.
Caer Sanders’s piece on martinfowler.com distills this philosophy into four actionable principles: predictable memory access, awareness of cache lines, single-writer, and natural batching. Each of these maps directly onto a hardware behavior that your code either works with or fights against. This post digs into the mechanisms behind each principle, because understanding why they matter makes them much easier to apply consistently.
The Latency Hierarchy You’re Working Against
Before the principles make sense, the numbers do. Modern CPUs operate on a memory hierarchy with radically different latency at each level. On a typical Intel or AMD processor in 2024, an L1 cache hit costs roughly 4 cycles, about 1 ns. L2 runs around 12 cycles. L3 lands between 30 and 45 cycles depending on size and whether you’re on the right NUMA socket. Main memory (DRAM) sits somewhere between 200 and 280 cycles, or 60 to 100 ns. At 3.5 GHz, a single DRAM miss costs the processor 300 idle cycles.
Jonas Bonér maintains a widely-cited gist with these numbers, derived from Jeff Dean’s original latency slides. The specific figures shift with each CPU generation, but the ratios stay roughly constant. L1 is about 100x faster than DRAM. That ratio is the reason all four principles below exist.
Predictable Memory Access
The hardware prefetcher is one of the most powerful optimizations a CPU performs automatically. It watches your memory access patterns and loads cache lines ahead of your current position when the pattern is detectable. Sequential array traversal is the canonical example: iterate forward through a large array and the prefetcher stays well ahead, keeping cache misses rare even on data that doesn’t fit in L3.
Random access breaks this completely. A linked list traversal where node pointers scatter across memory produces one guaranteed cache miss per node. There is no pattern for the prefetcher to learn. Ulrich Drepper’s What Every Programmer Should Know About Memory from 2007 remains the deepest treatment of this topic, and its core finding holds: pointer-chasing through a large dataset is an order of magnitude slower than streaming through a flat array of equivalent size.
The practical implication is that data structure choice is also a performance choice. std::vector versus std::list is not just an API preference; it’s a decision about whether the CPU’s prefetcher can help. In Java, arrays outperform LinkedList for iteration by factors that look embarrassing in benchmarks. The layout of your data determines whether sequential access is even possible, and sequential access determines whether the prefetcher does useful work.
Cache Line Awareness
Every read or write operates at the granularity of a cache line: 64 bytes on x86-64, and also on modern ARM64 including Apple M-series and AWS Graviton. When your code reads a single long, the CPU loads all 64 bytes containing that field into cache. This is beneficial when adjacent fields are accessed next; it’s harmful when two threads on different cores write to different variables that happen to share a cache line.
This is false sharing. The cache coherence protocol (MESI on x86) treats the entire 64-byte line as the unit of ownership. If core A writes to bytes 0-7 of a line and core B writes to bytes 8-15 of the same line, the protocol forces a round trip: B’s write invalidates A’s copy, A must re-fetch the line before its next write, and the cycle repeats. The hardware does this even though neither thread logically depends on what the other is writing.
Martin Thompson’s false sharing blog post from 2011 benchmarked this on a 4-core Nehalem machine and observed roughly a 70x throughput collapse between padded and unpadded versions of the same counter array. The fix is straightforward: pad each independently-written variable to fill a full cache line.
In Java, the manual padding approach adds dummy long fields to reach 64 bytes. Java 8 introduced @sun.misc.Contended to handle this more cleanly:
// Without padding: counter0 and counter1 share a 64-byte cache line
class Counters {
volatile long counter0 = 0; // bytes 0-7
volatile long counter1 = 0; // bytes 8-15, same cache line
}
// With manual padding
class PaddedCounter {
volatile long counter0 = 0;
long p1, p2, p3, p4, p5, p6, p7; // 56 bytes of padding
volatile long counter1 = 0; // now on a separate cache line
}
The JDK uses @Contended internally in LongAdder’s cell array and in ForkJoinPool’s work-stealing queues. In C and C++, alignas(64) accomplishes the same thing at the struct or variable level. The cache line principle also governs struct field ordering more broadly: hot fields that are accessed together belong together in memory, while cold fields belong at the end of the struct or in a separate allocation so they don’t evict hot data.
Single-Writer
The single-writer principle, articulated by Martin Thompson in a 2011 post, is: for any piece of mutable state, designate exactly one thread as the writer. Other threads may read freely.
This sounds like a simple concurrency rule, but its hardware justification goes deeper than just avoiding races. Write operations are expensive in the cache coherence model. A write requires the writing core to acquire exclusive ownership of the cache line via an RFO (Read For Ownership), which involves broadcasting an invalidation to all other cores holding the line, waiting for acknowledgments, and then completing the write. With multiple writers, this invalidation traffic becomes a bottleneck that scales with core count rather than with actual work.
The LMAX Disruptor is the canonical engineering application of this principle. Designed to process 6 million financial transactions per second on a single JVM thread, the Disruptor’s ring buffer assigns a sequence number padded to a full cache line, written by exactly one producer. Consumers advance their own sequence numbers, also padded, and never write to the producer’s sequence. The original benchmarks showed the Disruptor achieving roughly 25 million ops/second in a 1-producer/1-consumer configuration, compared to about 4 million with ArrayBlockingQueue. That 6x difference comes almost entirely from eliminating lock contention and false sharing on the sequence cursors.
// From the Disruptor source: sequence padded to a single cache line
class Sequence {
static final long INITIAL_VALUE = -1L;
private volatile long value = INITIAL_VALUE;
// 7 longs of padding = 56 bytes to fill the rest of the cache line
private long p2, p3, p4, p5, p6, p7, p8 = 7;
}
The single-writer principle doesn’t eliminate the need to communicate between threads; it shapes how that communication is structured. Producers write to their own state; consumers observe it. Ownership is explicit in the design, not implicitly managed by locks that serialize access at runtime.
Natural Batching
Batching is often framed as a throughput optimization for I/O or database interactions. The mechanical sympathy view of it is more fundamental: batching exploits hardware behavior at multiple levels of the memory hierarchy simultaneously.
At the DRAM level, memory is organized into rows. Activating a row costs roughly 30 ns, but subsequent accesses within the same activated row are significantly cheaper. Sequential memory access naturally keeps accesses within the same row active. Random access bounces between rows, paying the activation penalty repeatedly.
At the CPU level, tight loops over contiguous data enable the compiler and CPU to apply SIMD vectorization. SSE, AVX2, and AVX-512 instructions process 16, 32, or 64 bytes per instruction respectively. For vectorization to apply, data must be contiguous, the loop must be predictable, and there must be no aliasing concerns the compiler can’t resolve. Batching work into tight loops over flat arrays is what makes this possible.
The Disruptor’s BatchEventProcessor demonstrates natural batching in practice. When a consumer falls behind the producer, rather than processing one event and checking for more, it reads the current available sequence, processes all events up to that sequence in a tight loop, and then checks again. This keeps event data hot in cache and amortizes the cost of sequence checks and wait strategy overhead across multiple events.
In Go, the runtime’s scheduler applies similar logic when stealing goroutines from other processor queues. In Rust, iterator chains composed with map, filter, and collect can be fused by the compiler into a single pass over the data, rather than materializing intermediate collections, when the optimizer determines it’s safe to do so.
How These Principles Connect
The four principles reinforce each other rather than standing independently. Predictable access patterns let the prefetcher work. Cache-line awareness ensures that prefetched data isn’t evicted by unrelated writes or contaminated by false sharing between cores. Single-writer eliminates the coherence traffic that turns concurrent write access into a serialization point. Natural batching amortizes the overhead that remains after the first three principles are applied.
The deepest version of this insight is in Drepper’s memory paper: software design that ignores hardware behavior can perform arbitrarily worse than an equivalent design that respects it, with no change to the algorithm itself, simply by changing how memory is laid out and who writes to it.
Most of the time, none of this is the bottleneck. But when it is, the gap between code that understands hardware and code that doesn’t can span an order of magnitude, and the improvements come from design choices made before a single line of optimization is written.