Working With the Machine: Four Principles for Cache-Conscious Code

The phrase “mechanical sympathy” comes from Formula 1 driver Jackie Stewart, who said the best drivers understand their machines deeply enough to work with them rather than force them. Martin Thompson borrowed the phrase to describe the same orientation applied to software. Caer Sanders’ article on Martin Fowler’s site distills this into four practical principles: predictable memory access, awareness of cache lines, single-writer, and natural batching.

These aren’t exotic ideas reserved for financial trading infrastructure. They describe the physical reality of every general-purpose computer your code runs on, and understanding them explains a category of performance problems that profiling tools can identify but can’t fix for you.

The Memory Hierarchy Is Not Flat

Modern CPUs can execute multiple floating-point operations per clock cycle. At 3GHz, that’s billions of arithmetic operations per second. The problem is that compute capacity means nothing if the CPU spends most of its time waiting on memory.

The latency gap between CPU registers and DRAM has widened steadily for decades. Today, an L1 cache hit costs roughly 1-4 cycles. An L2 hit costs around 12. L3 adds another 40 or so. A main memory access costs 200-300 cycles, or 60-100 nanoseconds. These are order-of-magnitude differences, and they are the central constraint in a surprising number of real workloads.

The CPU does not fetch individual bytes from memory. It fetches cache lines, 64 bytes on modern x86 and ARM64. When you access one element of a struct, you’ve loaded the surrounding 64 bytes whether you asked for them or not. This is an opportunity when your data is laid out to take advantage of it, and a tax when it isn’t.

Predictable Memory Access

The hardware prefetcher is a circuit that watches your memory access patterns and loads upcoming cache lines before your code requests them. It’s effective at recognizing sequential and strided patterns. It fails completely at pointer chasing.

// Sequential: prefetcher loads ahead effectively
for (int i = 0; i < N; i++) {
    sum += array[i];
}

// Pointer chasing: each address is unknown until the previous load completes
Node *n = head;
while (n) {
    sum += n->value;
    n = n->next;  // can't prefetch; must wait
}

The matrix traversal benchmark makes this concrete. On a 1024x1024 matrix of 64-bit doubles stored in row-major order, column-major traversal produces roughly 4-8x lower throughput than row-major traversal on typical hardware, despite performing identical arithmetic. Every column-major access steps 1024 elements forward in memory, crossing a new cache line each time and producing a near-100% cache miss rate.

This same reasoning drives the Struct of Arrays pattern that appears in game engines, physics simulations, and anything touching SIMD:

// Array of Structs: iterating positions loads velocities and masses too
struct Particle { x: f32, y: f32, z: f32, vx: f32, vy: f32, vz: f32, mass: f32 }
let particles: Vec<Particle> = vec![...];

// Struct of Arrays: position data is contiguous; irrelevant fields stay cold
struct Particles {
    x: Vec<f32>, y: Vec<f32>, z: Vec<f32>,
    vx: Vec<f32>, vy: Vec<f32>, vz: Vec<f32>,
    mass: Vec<f32>,
}

If your physics loop reads velocities and writes positions, the SoA layout loads only what the loop needs. The data arrives pre-packed for SIMD vectorization, and the compiler can often auto-vectorize without any intrinsics.

Cache Line Awareness

Understanding that the unit of memory transfer is 64 bytes leads directly to understanding false sharing. Two variables that occupy the same cache line will produce coherence traffic when different threads write to each, even if they’re logically independent.

// Both counters almost certainly share a cache line.
// Every write by Thread A invalidates Thread B's cached copy.
class Counters {
    long counterA;
    long counterB;
}

The fix is to separate them by 64 bytes. Java has the @Contended annotation for this purpose (available via jdk.internal.vm.annotation.Contended in JDK 9+, previously sun.misc.Contended). In C, you pad manually:

struct __attribute__((aligned(64))) Counter {
    long value;
    char _pad[64 - sizeof(long)];  // 56 bytes of padding on 64-bit
};

The LMAX Disruptor applies this systematically. Every sequence number in the ring buffer is padded to its own cache line. The producer cursor, each consumer’s read position, and the ring buffer’s internal state all occupy separate lines so that no two threads ever write to the same cache line during normal operation. This eliminates coherence traffic between producer and consumer threads entirely.

The inverse of false sharing, grouping hot fields together so they share cache lines, is equally important. Hot/cold splitting keeps frequently accessed fields in one struct and rarely touched fields in another. When you load an object to read its frequently accessed state, you don’t pay to pull in the rarely accessed fields.

Single-Writer

This is the most architecturally significant of the four principles. The MESI coherence protocol (Modified, Exclusive, Shared, Invalid) ensures that all CPUs agree on memory contents, but it does so by invalidating cached copies when any CPU writes to a location. Multiple threads writing to the same cache line generate a continuous stream of invalidations that travel across the CPU interconnect, serializing what looks like parallel work.

The single-writer principle says to eliminate this contention by design: each piece of mutable state should have exactly one writer. Readers can be concurrent and cheap; writes carry the coherence cost.

The Disruptor demonstrates this at scale. Its ring buffer assigns each slot to exactly one producer. Consumers track their own read positions without coordination. There are no locks, no compare-and-swap loops, no synchronized blocks. When Martin Thompson published benchmark results in 2011, the Disruptor achieved roughly 25 million messages per second in common configurations, compared to around 6 million for Java’s ArrayBlockingQueue, with meaningfully lower tail latency.

In Rust, the ownership system enforces this at compile time. &mut T is a compiler-verified guarantee that no other thread holds a mutable reference to the same data. The compiler uses aliasing proofs to generate more aggressive code; you get both safety and performance from the same constraint.

The principle doesn’t require lock-free data structures. The question to ask before reaching for a mutex or atomic is whether you can restructure the system so each piece of state has a single owner at any given time. Often the answer is yes, and the restructuring produces simpler code as a side effect.

Natural Batching

When you must cross an expensive boundary, whether that’s a cache miss, a thread handoff, a syscall, or a network hop, cross it with as much work as possible. The fixed cost of crossing amortizes over batch size.

Batching reduces more than just throughput. It reduces the frequency of expensive operations, which reduces variance in your latency distribution. Systems that make many small, frequent handoffs tend to have high tail latency because any crossing can coincide with an unlucky moment: an evicted cache line, a context switch, a scheduler delay.

This pattern appears throughout the stack. BufWriter in Rust and BufferedWriter in Java exist because unbuffered writes produce one syscall per byte; buffering turns hundreds of syscalls into one. TCP’s Nagle algorithm batches small writes into a single segment. Database write-behind caches accumulate mutations and flush them in sequential runs, converting random I/O into sequential I/O.

For in-process threading, batching pairs naturally with the other principles. If a producer hands a batch of items to a consumer, the consumer can process them in a tight sequential loop with predictable access patterns, getting full use of each cache line it loads. The producer and consumer also cross the handoff boundary less frequently, reducing coherence pressure on the coordination mechanism.

Why These Principles Age Well

These ideas have circulated in performance-sensitive communities since at least the early LMAX work and Martin Thompson’s Mechanical Sympathy blog, which has been writing about this since 2011. The reason they’re still worth articulating clearly in 2024 is partly that diffusion through the broader engineering community takes time, and partly that high-level languages actively insulate developers from the hardware details that motivate them.

The hardware reality has changed less than the abstraction layer above it. Cache lines are still 64 bytes. DRAM latency is still two orders of magnitude higher than L1. The coherence protocol still serializes multiple writers to the same line. NUMA architectures on multi-socket servers add another dimension, where accessing remote socket memory costs considerably more than local memory.

When you hit a performance wall and your profiler shows cache miss rates above 10-15%, or a concurrent system that degrades under contention rather than scaling, these four principles are the diagnostic vocabulary. They point at the structural changes that help: access patterns that go sequential, data layouts that separate hot from cold, write ownership that eliminates coherence traffic, and batch sizes that amortize crossing costs.

The hardware hasn’t gotten more forgiving. Learning to work with it, rather than assuming the abstraction layers will compensate, remains one of the more durable skills in systems work.