· 8 min read ·

What the CPU Expects From Your Code: Four Principles of Mechanical Sympathy

Source: martinfowler

The term mechanical sympathy comes from Formula 1 driver Jackie Stewart, who argued that a great driver should understand the car well enough to work with it rather than against it. Martin Thompson brought the phrase into software, applying it to the relationship between programs and their hardware. The idea has circulated for years in high-performance systems circles, but Caer Sanders’s recent article on martinfowler.com distills it into four concrete, everyday principles: predictable memory access, cache line awareness, single-writer, and natural batching.

These are not obscure optimizations reserved for HFT systems or game engines. They show up in Linux’s io_uring, in the LMAX Disruptor, in database buffer pools, and in modern ECS game architectures. The patterns are general. Understanding where they come from, and what hardware realities they respond to, makes them much easier to apply with intention.

The Memory Hierarchy Is Not Flat

Modern CPUs execute instructions at a rate that DRAM cannot keep pace with. An L1 cache hit costs roughly 4 cycles. An L2 hit is around 12 cycles. L3 lands around 40 cycles. A main memory access is 200 to 300 cycles, sometimes more on NUMA systems where the memory is on a different socket. Multiply any of those latency differences by the number of cache misses your data structures generate per second, and you have most of your performance story.

Cache lines are the unit of transfer between memory and cache. On x86-64 and most ARM systems, a cache line is 64 bytes. When you read a single byte from RAM, you pay for 64 bytes of transfer. When you write a single byte to a cold cache line, the line must first be fetched, modified, and eventually written back. The memory system is inherently chunk-oriented. All four of the principles Sanders describes are responses to that single underlying reality.

Predictable Memory Access

Modern CPUs include hardware prefetchers: circuits that monitor your memory access patterns and begin fetching cache lines before you ask for them. They work well for sequential access through an array and can handle simple regular strides. They fail entirely at pointer chasing, because each dereference depends on data that has not yet arrived from memory.

This is the concrete reason a linked list traversal is almost always slower than an array traversal on modern hardware, regardless of algorithmic complexity. The nodes of a typical heap-allocated linked list are scattered across memory. Each next pointer introduces a dependency: the prefetcher cannot predict the next address until it reads the current node, which requires that cache line to arrive from memory first. You serialize on memory latency at every step.

// Sequential: the prefetcher handles this well
for (int i = 0; i < n; i++) {
    process(array[i]);
}

// Pointer chase: the prefetcher cannot help
Node *current = head;
while (current) {
    process(current->value);
    current = current->next; // each dereference may miss cache
}

Data structure choice is often a memory access pattern choice. If you iterate over every element of a collection, a contiguous array will outperform a linked list in practice even when the theoretical complexity is identical.

Array-of-structs (AoS) versus struct-of-arrays (SoA) is the more subtle version of this trade-off. If a struct has ten fields and a given processing loop touches only one of them, SoA layout means your working set is ten times smaller, your cache lines hold ten times more useful data per fetch, and your memory bandwidth goes further. The Entity Component System pattern in game development makes this explicit: components of the same type are stored in contiguous typed arrays so that systems iterating over one component type traverse memory sequentially. Sander Mertens’s ECS FAQ covers the access pattern reasoning in detail.

Cache Line Awareness

Cache line awareness has two sides: avoiding false sharing and maximizing true sharing.

False sharing is the more famous problem. Two threads write to different variables that happen to occupy the same 64-byte cache line. Each write causes that line to be invalidated in the other thread’s cache via the MESI coherence protocol (Modified, Exclusive, Shared, Invalid). The result is continuous inter-core coherence traffic, degrading throughput to roughly main-memory speeds even though neither thread reads what the other writes. The variables are logically independent but physically coupled.

// False sharing: both counters share one cache line
struct {
    long counter_a;  // thread A writes this
    long counter_b;  // thread B writes this
} shared;

// No false sharing: each counter owns its cache line
struct {
    long counter_a;
    char _pad_a[56]; // 64 - sizeof(long) = 56
    long counter_b;
    char _pad_b[56];
} shared;

In C++17, std::hardware_destructive_interference_size gives you the cache line size in a portable way. In Rust, #[repr(align(64))] on a wrapper struct achieves the same. In Java, the @Contended annotation (with -XX:-RestrictContended to enable it outside the JDK) adds padding around annotated fields. The LMAX Disruptor’s Sequence class is the canonical example: it pads the sequence value on both sides to 64 bytes, ensuring that no producer or consumer sequence counter shares a cache line with any other.

The other side is true sharing: data that is read together should live together. If a struct has fields A, B, and C that are always accessed as a group, and field D that is rarely touched, keep A, B, and C adjacent so a single cache line fetch covers all three. This is hot/cold field splitting by another name. The same reasoning applies to database row layout, network packet headers, and anything else with a hot read path that only touches a subset of the total data.

Single-Writer

When multiple threads write to the same cache line, the coherence protocol must coordinate. Every write from one core invalidates the copies in every other cache. As core count increases, this broadcast invalidation becomes increasingly expensive. The single-writer principle says: for any piece of shared state, designate exactly one thread as the writer. Readers may be many; writers must be one.

This is the central design insight of the LMAX Disruptor. The ring buffer has a single producer writing entries. Consumers read entries but never write back to state that the producer reads. Each sequence counter, whether producer-owned or consumer-owned, is written by exactly one thread. The result is a disruptor queue that achieves tens of millions of operations per second on commodity hardware without a single lock.

Contrast this with java.util.concurrent.ArrayBlockingQueue, which uses a ReentrantLock for both enqueue and dequeue. Acquiring and releasing a lock is itself a write to shared state, the lock’s internal state variable, which triggers cache line invalidation on the same paths. The Disruptor does not avoid synchronization by being clever about visibility; it avoids shared writes entirely.

Single-writer maps onto other scales too. Nginx and Redis use a single event loop per worker process or thread, which means all state for a given loop is owned by one thread with no contention. In database write-ahead logging, a common design has a single log writer thread that flushes entries from per-thread private buffers to the durable log file. The per-thread accumulation is single-writer; the flush step is serialized through one owner.

At the hardware level, CPUs include write-combining buffers that accumulate writes to write-combining or uncached memory before flushing them to the memory bus as a burst. If writes scatter across many cache lines, you fill many write-combining buffers simultaneously and risk partial flushes. A single stream of writes to a single region fills one buffer cleanly. Single-writer thinking applies even when the “thread” in question is the CPU’s own write pipeline.

Natural Batching

The overhead of coordination, I/O, or synchronization is typically fixed per operation rather than proportional to data size. Sending one message at a time across a socket means paying syscall cost, context switch overhead, and protocol framing for every message individually. Sending a hundred messages in one batch pays that cost once for a hundred items. The principle is simple, but the word “natural” is the important part of Sanders’s framing.

Natural batching means designing the system so that batches emerge from its structure without requiring explicit batch-size configuration or accumulation loops scattered through the code. The Disruptor implements this with its BatchEventProcessor. When the consumer finishes one event and checks for more, it reads the current maximum available sequence number and processes everything up to that point in a single loop. The batch size is “however much arrived while I was processing the previous batch.” There is no tuning knob.

// Natural batch: process everything available in one pass
long availableSequence = sequenceBarrier.waitFor(nextSequence);
while (nextSequence <= availableSequence) {
    T event = ringBuffer.get(nextSequence);
    eventHandler.onEvent(event, nextSequence, nextSequence == availableSequence);
    nextSequence++;
}

Linux’s io_uring applies the same logic to kernel I/O. You submit a batch of operations to the submission queue ring and wait for results on the completion queue ring. The kernel processes multiple operations per entry into kernelspace, reducing context switches from userspace to near zero per operation at high submission rates. Jens Axboe’s design document for io_uring describes how this amortization is a first-class design goal, not an incidental benefit. Traditional read and write syscalls pay the user-to-kernel boundary cost once per call; io_uring pays it once per batch.

TCP’s Nagle algorithm is another instance: accumulate small writes into a buffer until either a timeout fires or enough data is ready, then send one segment. TCP_NODELAY disables this when latency matters more than throughput. The underlying choice is always the same: how many units of work per unit of fixed overhead.

Batching also compounds the benefits of the other principles. Processing 64 items from a contiguous array in one tight loop means the prefetcher has already fetched upcoming items into L1 cache by the time you reach them. Switching context between items, yielding to a scheduler, acquiring a lock, or dispatching through a callback layer evicts that warm cache state. Keeping the hot path in a single loop preserves the prefetcher’s work.

How the Principles Compose

The four principles reinforce each other, and the LMAX Disruptor is the clearest example of all four working together deliberately. Predictable memory access fills the cache with useful data. Cache line awareness ensures that one core’s useful writes do not invalidate another’s useful reads. Single-writer eliminates the coherence traffic that would otherwise accompany coordination. Natural batching amortizes whatever coordination remains over enough work to make the per-item cost negligible.

You can find subsets of the pattern throughout high-performance systems. PostgreSQL’s shared buffer pool uses a clock-sweep replacement algorithm that favors sequential page access over the pool’s circular buffer. ClickHouse’s columnar storage layout is a direct application of SoA thinking: storing each column contiguously means analytical queries that touch two of twenty columns fetch roughly one-tenth of the data a row store would require. The Linux kernel’s per-CPU data structures use single-writer semantics enforced by disabling preemption, with explicit migration barriers when data must move between CPUs.

Mechanical sympathy is sometimes presented as an advanced specialization, something you reach for only when a profiler tells you to. Sanders’s framing in the Fowler article pushes back on that usefully. These are everyday principles, applicable during ordinary design decisions: whether to use a linked list or an array, how to lay out fields in a struct, whether a piece of state needs a shared counter or a per-thread one, whether a processing function should handle one item or drain a queue. The CPU is fast. Memory is the constraint. These four principles are mostly about feeding the CPU correctly from the start, rather than discovering much later that you were not.

Was this interesting?