Four Principles, One Protocol: The MESI Theory Behind Mechanical Sympathy

When Caer Sanders distills mechanical sympathy into four principles in a piece on Martin Fowler’s site, they present them as separate practices: predictable memory access, cache line awareness, single-writer, and natural batching. That framing helps with learning them, but it can obscure that all four are responses to the same hardware mechanism: the MESI cache coherence protocol.

Understanding MESI does not make the four principles redundant; it makes them easier to apply consistently, and it makes violations easier to spot in unfamiliar contexts.

The State Machine Behind Every Memory Access

MESI governs how caches across cores stay consistent without a central memory authority. Each 64-byte cache line on each core is in one of four states at any moment:

Modified: this core has written to the line and holds the only valid copy.
Exclusive: this core has the only copy, and it is clean, matching main memory.
Shared: multiple cores hold clean copies. Reads generate no coherence traffic.
Invalid: this core’s copy is stale. Any access requires a fetch.

Reads in Shared state are free. Writes are expensive: a core wanting to write a Shared line must broadcast an invalidation to every other holder, wait for acknowledgments, and only then proceed. That round trip scales in cost with core count and interconnect topology.

The four mechanical sympathy principles are strategies for staying in favorable MESI states and avoiding expensive transitions. All four reduce to one principle: minimize Shared-to-Invalid transitions by controlling who writes what and when.

Predictable Access Keeps Shared State Affordable

The hardware prefetcher watches access patterns and loads cache lines before they are needed, but only when those patterns are regular enough to predict. Sequential and strided access is predictable; pointer-chasing is not. When the prefetcher cannot help, every cache miss falls through to DRAM at 60 to 100 nanoseconds, versus around 1 nanosecond for L1.

Sequential access through a large dataset typically runs 100 to 200 times faster per element than random access on the same data. The prefetcher keeps the pipeline fed in the sequential case; it cannot in the random case.

Data layout has the most leverage here. Consider a simulation updating particle positions with an Array of Structs layout:

struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    int type;  // 32 bytes per particle, two particles per 64-byte cache line
};
Particle particles[N];

A loop that updates only x, y, z loads vx, vy, vz, mass, and type into every cache line and discards them. Half the memory bandwidth moves data the loop never uses. Structure of Arrays fixes this:

struct Particles {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
    float mass[N];
    int   type[N];
};

The x array delivers 16 floats per cache line, all consumed by the position update loop. Mike Acton’s CppCon 2014 talk on Data-Oriented Design documented throughput differences of 10 to 50 times in game engine inner loops from layout changes alone. The algorithm is identical between the two layouts; the data layout determines how much of each cache line the inner loop actually uses.

The coherence protocol treats the 64-byte cache line as its unit of ownership. Two threads writing to different variables that share a cache line both trigger Shared-to-Invalid transitions, even if they never logically share data. The protocol does not inspect the contents; it only tracks writes to the line.

struct Counters {
    std::atomic<int64_t> requests;  // offset 0
    std::atomic<int64_t> errors;    // offset 8, same cache line as requests
};

Thread A increments requests. Thread B’s copy of the line is immediately Invalid. Thread B must re-fetch before incrementing errors. Thread A’s next increment repeats the cycle. Both threads are serialized by the coherence protocol despite operating on separate fields. Measured slowdowns under this pattern run 5 to 20 times, scaling worse as core count increases because more cores hold stale copies that need invalidation.

The fix is alignment to cache line boundaries:

#ifdef __cpp_lib_hardware_interference_size
    constexpr std::size_t cache_line = std::hardware_destructive_interference_size;
#else
    constexpr std::size_t cache_line = 64;
#endif

struct alignas(cache_line) PaddedCounter {
    std::atomic<int64_t> value{0};
};

PaddedCounter counters[MAX_THREADS];  // each counter on its own line

Java’s @Contended annotation (as @jdk.internal.vm.annotation.Contended) handles this automatically; the JVM inserts padding around annotated fields. Diagnosing false sharing in production rather than in microbenchmarks requires perf c2c, which reports cross-core cache-to-cache transfers at symbol granularity:

perf c2c record -g -- ./your_binary
perf c2c report --stdio

The inverse also applies: placing a lock and the data it protects on the same cache line means the thread that acquires the lock already has the protected data in L1. C++17’s std::hardware_constructive_interference_size enforces this co-location.

Single-Writer Is Coherence Protocol Steering

If false sharing is the pathological case of two writers contending a line, single-writer is the architectural response: designate exactly one thread as the writer to each piece of data. Other threads read freely in Shared state with no coherence traffic. The writer keeps the line in Modified or Exclusive, where writes are cheap.

The LMAX Disruptor implements this directly. Each sequence counter has exactly one owner, one producer or one consumer. No two threads compete to write the same location. The Disruptor achieves millions of inter-thread messages per second on commodity hardware using a ring buffer and this single principle applied consistently.

Rust’s ownership system encodes single-writer as a compile-time invariant. The &mut T type guarantees exclusive mutable access; the borrow checker ensures no two owners hold a mutable reference simultaneously. Aliased mutable state is both the source of memory safety bugs and the source of coherence overhead, and Rust’s design eliminates both at the same enforcement point. The language semantics and the hardware requirement map onto each other directly.

One qualification: single-writer does not eliminate the need for memory barriers. A reader must observe the writer’s writes in the correct order. Acquire/release semantics on atomic operations enforce visibility without requiring mutual exclusion; they are the mechanism that makes single-writer coherent from the reader’s perspective.

Natural Batching Amortizes Transition Costs

Every expensive operation has a fixed overhead component. Syscalls, lock acquisitions, memory allocations, network round trips: each costs roughly the same regardless of how much work it delivers. Batching amortizes that fixed cost across more work items, reducing how often any expensive transition is triggered.

The Disruptor implements this on the consumer side: a consumer claims all available slots up to the producer’s current sequence rather than one at a time. Under load, consumers process items in natural clusters that the branch predictor handles as a tight inner loop rather than as sporadic individual dispatches.

Linux io_uring generalizes batching into a system call interface: enqueue multiple I/O operations into a submission ring, then submit them with a single syscall. Per-operation overhead drops from microseconds to nanoseconds for workloads that issue many small requests.

At the instruction level, batching enables auto-vectorization. Compilers vectorize tight loops over contiguous data when the loop body is simple and contains no function calls creating alias-analysis barriers. A single non-inlined call inside a loop prevents vectorization entirely, because the compiler cannot reason about aliasing or side effects past it. With inlining enabling AVX2, a simple arithmetic loop can drop from roughly 1.0 ns per element to 0.08 to 0.12 ns, an order of magnitude difference from the optimizer seeing across the function boundary.

A Case Study: Go’s Green Tea GC

Go 1.24’s garbage collector applied all four principles simultaneously to the marking phase. The original implementation maintained a work list of object addresses. Processing that list required following each pointer to the target object through randomly placed heap allocations. Measured at production scale, 35% of GC mark CPU time was stalled on main memory, a direct consequence of the random access pattern.

The redesign replaced the object-address work list with a page-level FIFO queue. The GC marks by scanning contiguous span bitmaps rather than chasing individual pointers. Sequential bitmap access enables the hardware prefetcher. Consecutive bits in a bitmap share cache lines, delivering hundreds of object states per load. Each span bitmap has one designated writer. The GC processes all marks in a span before moving to the next, forming a natural batch.

The result is a 10 to 40% reduction in GC CPU time across all Go workloads regardless of application type. No algorithm changed; only the data layout and access pattern changed. The four principles compounded because they all addressed the same bottleneck: cache miss cost during traversal. Each one contributed by steering the GC’s memory access pattern toward favorable MESI states.

Beyond Single-Socket: NUMA

On multi-socket servers, the mechanical sympathy analysis extends. NUMA (Non-Uniform Memory Access) adds a layer above the cache hierarchy: memory attached to one socket costs 30 to 60 nanoseconds more per access from a different socket than from memory local to that socket. For workloads whose hot data exceeds L3 capacity (typically 32 to 96 MB per socket on current processors), remote DRAM accesses pay that penalty constantly.

The MESI protocol still governs coherence across sockets; the costs are higher and the optimization surface includes which NUMA node holds each allocation. Predictable access remains the first concern, but topology-aware allocation becomes an additional degree of freedom. numactl --hardware shows the distance matrix; numastat -p <pid> shows the per-NUMA-node memory distribution at runtime for a running process.

The Underlying Unity

All four principles trace back to the same observation: the hardware memory subsystem is a state machine with cheap states and expensive transitions. Predictable access keeps the prefetcher in a state where it can help. Cache line awareness prevents unnecessary Shared-to-Invalid transitions from unrelated writes. Single-writer keeps each line in Exclusive or Modified rather than cycling through Shared. Batching reduces how often any expensive operation is triggered.

The original article notes that you do not need to be a chip designer to benefit from these principles. The more useful framing is that you need to understand one thing about the chip: cache coherence is a state machine, and software pays for the transitions it forces. Learning the four principles individually is sufficient for most performance work. Understanding their shared source in MESI makes it easier to recognize violations in unfamiliar situations and to apply the underlying logic when the specific rule does not obviously fit.