The 64-Byte Root Cause: How Every Mechanical Sympathy Principle Follows from Cache Line Hardware
Source: martinfowler
Everything Reduces to 64 Bytes
Modern CPUs do not move individual bytes or words between RAM and cache. They move 64-byte blocks, called cache lines, and this one fact is the root cause of nearly every performance pathology in concurrent and data-intensive code. Caer Sanders documented four principles of mechanical sympathy in a Martin Fowler article: predictable memory access, awareness of cache lines, single-writer, and natural batching. Each one is a direct consequence of how the hardware manages those 64-byte transfers.
The term itself comes from F1 driver Jackie Stewart, who said you do not have to be an engineer to be a racing driver, but you do have to have mechanical sympathy. Martin Thompson borrowed the phrase and applied it to software at LMAX, where his team built the Disruptor, a ring buffer that processed around 25 million messages per second compared to 4-5 million for ArrayBlockingQueue, using no tricks beyond a rigorous application of these four principles. Thompson’s mechanical sympathy blog remains one of the better places to follow this thread. Ulrich Drepper’s 2007 paper What Every Programmer Should Know About Memory is the canonical technical reference behind all of it.
The Cache Hierarchy You Are Actually Programming Against
Before getting into the four principles, the latency numbers matter. L1 cache hits cost roughly 1 nanosecond, or about 4-5 clock cycles. L2 is around 4ns (12-14 cycles). L3 ranges from 15-30ns (40-75 cycles). A main memory access costs 60-100ns, which translates to 200-300 cycles. RAM is approximately 100x slower than L1.
The CPU’s response to this gap is the prefetcher: hardware logic that watches access patterns and speculatively loads cache lines before the code requests them. When the prefetcher wins, you pay L1 latency. When it loses, you pay RAM latency. The four principles are, in essence, four different ways to keep the prefetcher winning and to avoid forcing the cache coherency protocol to intervene.
Predictable Memory Access
Sequential memory access is the pattern the prefetcher was designed for. When you scan a contiguous array, each cache line load brings in 8 adjacent 64-bit values, and the prefetcher issues the next load before the current one finishes. Pointer-chasing (linked lists, trees, hash maps with open addressing across scattered allocations) defeats this entirely. Each node dereference lands at an address the prefetcher cannot predict, so nearly every access goes to RAM.
The throughput difference between sequential array scans and random pointer-chasing access is 8-25x, depending on data size and access pattern. That range is wide because it depends on whether the working set fits in L2 or L3, but even at the optimistic end it is a significant tax on every cache miss.
In Java, this argues for arrays of primitives or flat value objects over linked structures. In C++, std::vector over std::list. In Rust, Vec<T> over Box<Node<T>> chains. The language does not matter; the layout in memory does.
Awareness of Cache Lines
If predictable access is about keeping the prefetcher happy, cache line awareness is about not wasting the transfers it performs. Each 64-byte line is the atomic unit of transfer. If two logically independent variables share a line, every write to either one from any core invalidates the whole line for every other core. This is false sharing, and it triggers the MESI protocol’s cache coherency machinery at full cost.
The performance impact is severe. False sharing between two threads writing to variables on the same cache line can degrade throughput by 10-40x, because the line bounces between cores’ L1 caches with every write, and each bounce costs a full cross-core coherency round trip.
The fix in each language is to ensure hot, independently written variables occupy separate cache lines.
Java — before Java 8 you hand-padded with dummy longs. Java 8 added @Contended (JEP 142), which asks the JVM to insert padding automatically:
// Bad: x and y share a cache line, false sharing under concurrent writes
class BadCounters {
volatile long x;
volatile long y;
}
// Better: manual padding to isolate value on its own cache line
class PaddedCounter {
long p1, p2, p3, p4, p5, p6, p7;
volatile long value;
long p9, p10, p11, p12, p13, p14, p15;
}
// Best: let the JVM handle padding automatically
class ContendedCounter {
@jdk.internal.vm.annotation.Contended
volatile long value;
}
C++17 — std::hardware_destructive_interference_size is the standard way to query the cache line size at compile time:
#include <new>
struct alignas(std::hardware_destructive_interference_size) Counter {
std::atomic<long> value{0};
};
// Or with a hardcoded constant when you know the target platform:
struct alignas(64) Counter {
std::atomic<long> value{0};
};
Rust — the crossbeam-utils crate provides CachePadded<T>, or you can use #[repr(align(64))] directly:
use crossbeam_utils::CachePadded;
struct Counters {
x: CachePadded<std::sync::atomic::AtomicI64>,
y: CachePadded<std::sync::atomic::AtomicI64>,
}
// Without a dependency:
#[repr(align(64))]
struct AlignedCounter {
value: std::sync::atomic::AtomicI64,
}
Single-Writer
The single-writer principle takes cache line awareness one step further: beyond padding to prevent false sharing, you arrange the system so that each cache line is owned by exactly one writer at a time. This eliminates write contention entirely rather than merely reducing its cost.
The Disruptor’s Sequence class is the canonical implementation. The sequence tracks the published position in the ring buffer, and it is written by exactly one producer or one consumer. The class isolates the counter on its own cache line with explicit padding on both sides:
class LhsPadding {
protected long p1, p2, p3, p4, p5, p6, p7;
}
class Value extends LhsPadding {
protected volatile long value;
}
class RhsPadding extends Value {
protected long p9, p10, p11, p12, p13, p14, p15;
}
public final class Sequence extends RhsPadding {
// value sits at the center of two cache lines.
// Left padding guards adjacent fields in the same object;
// right padding guards the next object in the same memory region.
}
The volatile on value provides the happens-before guarantee without a full lock. Readers can observe the sequence without acquiring write ownership of the line. Because only one thread ever writes the line, the MESI protocol never needs to invalidate it from a reader’s cache; readers always find it in a shared clean state.
This pattern applies wherever you have a variable that one thread writes and many threads read. The single writer never contends with itself, and readers pay only the cost of a cache line read, not a coherency round trip.
Natural Batching
Natural batching emerges from the ring buffer’s structure rather than from explicit batching logic. The ring buffer is a fixed-size contiguous array with sequential layout, exactly what the prefetcher wants. When a consumer processes events, it reads sequentially through the ring, one cache line at a time, and the prefetcher runs ahead loading the next lines before the consumer reaches them.
The more interesting property is what happens under load. When a producer outpaces a consumer, events accumulate in the ring. When the consumer catches up, it finds multiple events ready to process and handles them in a tight sequential loop:
// The consumer checks how far ahead the producer is and
// processes everything available in one sequential pass.
long availableSequence = sequenceBarrier.waitFor(nextSequence);
while (nextSequence <= availableSequence) {
event = ringBuffer.get(nextSequence);
eventHandler.onEvent(event, nextSequence, nextSequence == availableSequence);
nextSequence++;
}
There is no external batching configuration. The batch size is determined by how much the producer has advanced while the consumer was processing the previous batch. Under high load, batches grow automatically, and the sequential scan through contiguous ring buffer slots means the whole batch is cache-warm by the time the loop reaches the later entries. Under low load, latency stays minimal because the consumer processes each event as it arrives. The system self-tunes.
The Single Underlying Mechanism
All four principles reduce to the same hardware constraint. The CPU cannot move less than 64 bytes at once. It cannot access memory randomly without paying full RAM latency. It cannot update one thread’s view of a cache line without invalidating every other thread’s copy via MESI. Mechanical sympathy, as Thompson framed it, means writing code that works with these constraints rather than against them.
The Disruptor’s throughput numbers are what happen when all four principles apply simultaneously to the same hot path. Predictable access keeps the prefetcher ahead of the consumer. Cache line padding prevents producer and consumer sequences from sharing lines. Single-writer ensures no two threads race to write any given line. Natural batching turns load spikes into sequential bursts that the prefetcher handles efficiently. Each principle individually buys something; applied together on one data structure, they compound.
The hardware is not going to change. Cache lines have been 64 bytes on x86, ARM, and POWER CPUs for long enough that it is effectively a constant. The prefetcher will keep rewarding sequential access. The MESI protocol will keep penalizing false sharing. Understanding why the four principles work is more durable than memorizing the patterns, because the underlying mechanics justify every new pattern you encounter going forward.