· 7 min read ·

Cooperating With the Machine: Four Principles of Hardware-Aware Software Design

Source: martinfowler

The idea that programmers should understand their hardware goes back to the beginning of computing, but it fell out of fashion as abstraction layers multiplied. Martin Thompson brought it back into focus with a term borrowed from Formula 1 driver Jackie Stewart: mechanical sympathy. Stewart meant that a driver does not need to be a mechanic, but understanding how the car behaves makes the driver faster. Thompson applied the same logic to software: you do not need to design CPUs to write high-performance code, but knowing how the hardware behaves lets you write code that cooperates with it instead of fighting it.

A recent Martinfowler.com article by Caer Sanders distills this into four everyday principles: predictable memory access, cache line awareness, single-writer, and natural batching. Each one maps to a specific hardware behavior, and understanding the hardware makes the principles feel less like rules and more like conclusions you would reach yourself if you stared at the architecture long enough.

The Memory Hierarchy Is Not Flat

Modern CPUs have multiple levels of cache between the processor and main memory, and the performance differences between those levels are large enough to dominate the runtime of data-intensive code.

LevelApproximate latencyTypical size
L1 cache~1 ns / 4 cycles32–64 KB
L2 cache~4 ns / 12 cycles256 KB – 1 MB
L3 cache~10–40 ns / 40 cycles4–32 MB
DRAM~60–100 ns / 200 cyclesGBs

A cache miss that reaches DRAM costs roughly 200 times what an L1 hit costs. If a tight loop causes frequent DRAM accesses, all the work the CPU does between those accesses is largely irrelevant to throughput. The bottleneck is not the processor, it is the memory bus.

This hierarchy is the foundation of all four principles. Every principle Sanders identifies is either exploiting the cache or protecting it from interference.

Predictable Memory Access

The hardware prefetcher is a circuit inside the CPU that watches memory access patterns and speculatively loads cache lines before they are requested. It works well on sequential access and fixed strides. It fails on pointer chasing.

This is why traversing an array is so much faster than traversing a linked list at scale. Both do the same logical work, but the array gives the prefetcher a predictable stride of one element at a time, while the linked list sends it to a different address on every step. By the time the CPU has finished processing one node, the next pointer goes somewhere the prefetcher could not have anticipated.

// Prefetcher works: sequential, predictable stride
int sum = 0;
for (int i = 0; i < N; i++) sum += array[i];

// Prefetcher gives up: pointer chasing, random addresses
int sum = 0;
Node* cur = head;
while (cur) { sum += cur->value; cur = cur->next; }

The performance gap in benchmarks is routinely 5–10x for large datasets. At small sizes both structures fit in L1 and the difference does not matter. At large sizes the array stays near memory bandwidth limits while the linked list hits DRAM on almost every node.

The struct-of-arrays (SoA) pattern follows from the same reasoning. If you have a collection of objects with many fields but you only process one field across all of them in a given loop, grouping all values of that field into a contiguous array lets the prefetcher do its job and fits more useful data into each cache line.

// Array of Structs: wastes cache when only 'x' is needed
struct Particle { float x, y, z, mass; };
Particle particles[N];

// Struct of Arrays: every cache line packed with relevant data
struct Particles {
    float x[N]; float y[N]; float z[N]; float mass[N];
};

Cache Line Awareness and False Sharing

The cache works in 64-byte units, called cache lines, on all modern x86 and ARM64 hardware. When any byte in a 64-byte region is accessed, the entire line moves between cache levels. This is an optimization for spatial locality, but it creates a concrete problem in concurrent code.

False sharing occurs when two threads write to different variables that happen to occupy the same 64-byte cache line. Logically they are independent, but the cache coherence protocol (MESI on x86) treats them as competing for ownership of the same line. Every write by Thread A invalidates the copy in Thread B’s cache, and vice versa. The CPU spends cycles transferring and invalidating a cache line for a write the other thread does not care about. The performance penalty in tight concurrent loops can be 2–40x depending on contention level.

// Two counters on the same cache line: false sharing
public class FalseSharing {
    volatile long counter1 = 0;  // bytes 0-7
    volatile long counter2 = 0;  // bytes 8-15 -- same line
}

// Padded to separate cache lines: no sharing
public class PaddedCounter {
    volatile long counter;
    long p1, p2, p3, p4, p5, p6, p7; // 56 bytes of padding
}

Java 8 added the @Contended annotation (in sun.misc, requiring -XX:-RestrictContended) to express this intent without manual padding. The LMAX Disruptor pioneered the manual padding approach before the annotation existed, explicitly wrapping its Sequence cursor in padding fields to ensure it never shares a cache line with neighboring data.

The Single-Writer Principle

The MESI cache coherence protocol allows multiple cores to hold a cache line in Shared state simultaneously, provided none of them is writing. The moment one core writes, it must acquire exclusive Modified state, and every other core holding that line in Shared or Exclusive state must invalidate its copy. This round-trip is an inter-core coordination cost that scales poorly as core count grows.

Martin Thompson formalized the single-writer principle as the observation that if only one thread ever writes to a given memory location, this invalidation traffic disappears entirely. The writing core always holds the line in Modified state. Other cores can read it after it is released, but there is never contention between competing writers.

The Disruptor builds this into its architecture. Each ring buffer slot is written by exactly one producer. Each sequence counter is owned by exactly one consumer. There are no shared mutable locations contested between multiple writers. Cache lines stay warm, and coherence traffic stays low.

The principle also eliminates the need for locks in many cases. Locks exist to serialize concurrent writes to shared state. If state is never written by more than one thread, the lock serves no purpose. This is not a trick specific to low-level systems code; it applies any time you can partition your data so that each region has a single owner.

Natural Batching

When a consumer in a pipeline is waiting for new work, it can check whether anything is available, take one item, process it, and check again. Or it can check once, take everything available up to the current sequence, process it in a tight loop, and then check again.

The second approach is natural batching. It is not a throughput optimization in the sense of doing more computation per unit time. It is a cost amortization strategy. The cost of a memory barrier, a cache miss on the sequence counter, or a syscall is paid once per batch rather than once per item. If the batch has one item, there is no saving. If it has a hundred items, the fixed cost shrinks to a rounding error.

The Disruptor’s BatchEventProcessor drains all available events in a single loop before waiting again:

long availableSequence = sequenceBarrier.waitFor(nextSequence);
while (nextSequence <= availableSequence) {
    event = ringBuffer.get(nextSequence);
    handler.onEvent(event, nextSequence, nextSequence == availableSequence);
    nextSequence++;
}

One memory barrier check, one spin-wait, then a tight sequential loop through pre-allocated contiguous memory. The prefetcher handles the sequential access pattern. The batch loop keeps the CPU working in L1 territory for the duration of the drain.

The Disruptor as a Unified Case Study

The LMAX Disruptor is worth studying here because it applies all four principles simultaneously and has documented benchmarks. Martin Thompson reported throughput of over 25 million events per second for a single producer, single consumer configuration, compared to roughly 4–5 million for java.util.concurrent.LinkedBlockingQueue under similar conditions.

The ring buffer is a pre-allocated fixed-size array: predictable memory access. Each entry and each sequence counter occupies its own cache line: cache line awareness. Each slot has exactly one producer and each sequence has exactly one owner: single-writer. Consumers drain all available events in one pass: natural batching.

None of these in isolation is sufficient. The gains compound because the CPU can now do what it was designed to do: stream through contiguous memory with the prefetcher running ahead, without fighting other cores for cache line ownership, without paying per-item barrier costs, without heap allocation pressure from a blocking queue.

Why This Matters Outside High-Frequency Trading

The Disruptor emerged from the specific demands of a financial exchange, but the principles apply any time code spends most of its time processing data in loops. Game engines processing entity components, database engines scanning indexes, network servers demultiplexing packets, image processing pipelines: all of them benefit from the same analysis.

The abstraction layers of modern software tend to obscure these patterns. Heap-allocated objects spread across memory. Queues that allocate per-item. Thread pools that write shared counters on every task dispatch. These are all reasonable defaults in contexts where throughput is not the binding constraint, but they are worth questioning when it is.

What Sanders’ article on Martinfowler.com does well is frame this as a set of principles with everyday applicability rather than as arcane optimization lore reserved for systems programmers. You do not need to rewrite your hot path in assembly to benefit from cache line awareness. You need to know that 64 bytes is the unit, that false sharing has a real cost, and that a few padding fields or a @Contended annotation can eliminate it. That knowledge pays for itself the first time you profile a concurrent workload and watch the cache miss counter drop after a structural change.

Understanding the hardware does not mean writing to the metal. It means making informed choices at the design level, before the profiler even runs.

Was this interesting?