The Hardware Contract: Four Principles for Writing Software That Works With the Machine

There is a term that keeps resurfacing in high-performance systems work: mechanical sympathy. It comes from racing driver Jackie Stewart, who said that to be a great driver you did not need to be a great mechanic, but you needed enough mechanical sympathy to understand what the car was doing. Martin Thompson borrowed it for software in the early 2010s while building the LMAX Disruptor, a lock-free inter-thread messaging library that at the time was processing six million orders per second on a single thread. The idea was simple: your code runs on hardware, and hardware has opinions. Ignoring those opinions does not make them go away.

A recent article on Martin Fowler’s site by Caer Sanders distills this into four everyday principles: predictable memory access, awareness of cache lines, single-writer, and natural batching. These are not exotic optimizations. They are baseline habits that prevent you from accidentally fighting the hardware on every hot path.

This post goes deeper on each one, with concrete examples and the hardware reasoning behind them.

Why Modern Hardware Has Such a Wide Gap Between Its Best and Worst Case

Before the principles, you need to understand the numbers. On a typical x86-64 machine in 2025, an L1 cache hit costs roughly 1 nanosecond. L2 is around 4ns. L3 is 10 to 40ns depending on cache size and topology. A main memory access is 60 to 100ns. That is not a linear scale. Going from L1 to RAM is a 60x to 100x slowdown, and it happens silently: the CPU stalls, waiting for the memory controller to fetch data.

The prefetcher exists to hide this latency. It watches your memory access patterns and speculatively loads data before you ask for it. If you access memory sequentially, the prefetcher keeps up. If you scatter reads across memory unpredictably, it cannot, and you pay full DRAM latency on every miss.

This is the hardware contract. Sequential, predictable, localized access is fast. Scattered, pointer-chasing, random access is slow. Mechanical sympathy is about writing code that honors that contract.

Principle 1: Predictable Memory Access

The most impactful change many codebases can make is switching from Array-of-Structs (AoS) to Struct-of-Arrays (SoA) layout for hot data. Consider a particle system:

// Array of Structs - common default
struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    int alive;
};
Particle particles[100000];

// Struct of Arrays - cache-friendly for computation
struct ParticleSystem {
    float x[100000];
    float y[100000];
    float z[100000];
    float vx[100000];
    float vy[100000];
    float vz[100000];
    float mass[100000];
    int alive[100000];
};

If your physics loop only touches x, y, z, vx, vy, vz each frame, the AoS layout loads mass and alive into cache on every particle, wasting cache capacity on data you are not using. The SoA layout lets you stream through only the arrays you need, and the prefetcher can stride predict across them cleanly.

This matters more than it used to because SIMD (Single Instruction Multiple Data) instructions, which are how modern CPUs do floating-point work fast, require contiguous, same-typed data. The SoA layout makes vectorization straightforward; the AoS layout often prevents it entirely or forces the compiler to emit expensive gather/scatter instructions.

The Rust nalgebra library and game engines like Bevy use ECS (Entity Component System) architectures that enforce SoA at the system design level, not just as an optimization afterthought.

Principle 2: Cache Line Awareness

Cache lines are 64 bytes wide on x86 (128 bytes on Apple Silicon). The CPU never loads a single byte from memory; it loads the whole line. This creates two problems.

The first is false sharing: two threads writing to different variables that happen to sit on the same cache line. Each write invalidates the line in the other core’s cache via the MESI protocol, forcing it to re-fetch the line before the next read. The threads are not sharing data logically, but the hardware thinks they are.

// False sharing - counters likely share a cache line
long[] counters = new long[THREAD_COUNT];

// Each thread writes counters[threadId] heavily
// But threads thrash each other's cache lines

// Fixed: pad to 64 bytes
@Contended  // Java 8+ annotation
static class PaddedCounter {
    volatile long value;
    // JVM adds padding to fill the cache line
}

Java’s @Contended annotation (or manual padding in C) tells the runtime to place this field on its own cache line. The LMAX Disruptor uses this extensively. The JDK’s LongAdder uses padded cells for exactly this reason and outperforms AtomicLong significantly under contention.

The second problem is the inverse: failing to pack related data together. If two fields are always read together but sit in different cache lines, you pay two cache misses instead of one. Struct layout order is not free. In C and Rust, the compiler will respect your field ordering (modulo alignment padding). Putting frequently co-accessed fields adjacent is a real optimization.

In Rust, you can audit your struct layouts with std::mem::size_of and std::mem::offset_of, or use the memoffset crate for stable offset inspection.

Principle 3: Single-Writer

The single-writer principle says that for any given piece of mutable state, only one thread should ever write to it. Reads can be distributed, but writes must be exclusive, not through locks, but through ownership.

Locks are slow not primarily because of the lock acquisition overhead, but because of what they force: when multiple threads contend on a lock, you get cache line bouncing. The protected data has to be invalidated and re-fetched across cores on every write. The lock itself becomes a bottleneck independent of what it protects.

The Disruptor solves this by partitioning ownership. Each producer has a slot it owns. The ring buffer is pre-allocated. No garbage is created. No locks are taken. Producers claim a sequence number atomically (one CAS), write into their slot, then publish. Consumers read sequentially. Because the ring buffer’s sequence counters are the only contended state, and those are padded to avoid false sharing, the throughput scales cleanly.

// Disruptor single-producer example (simplified)
long sequence = ringBuffer.next();  // claim slot
try {
    DataEvent event = ringBuffer.get(sequence);
    event.value = computedValue;    // write to owned slot
} finally {
    ringBuffer.publish(sequence);   // make visible to consumers
}

The same principle applies at larger scales. Actor systems like Akka and Erlang’s processes enforce single-writer by construction: each actor owns its mailbox and state. No two actors share mutable state. The Go memory model enforces similar discipline via channel communication. The hardware benefit is the same: a piece of state has one home, one cache that owns it, and no coherence traffic.

In database systems, this shows up as the single-writer log. Kafka and its successors funnel all writes through a single append-only log, not for simplicity alone but because sequential writes to a single location are as fast as storage gets.

Principle 4: Natural Batching

Batching amortizes fixed costs over multiple operations. The fixed cost is usually a round trip: a syscall, a network flush, a disk write, a lock acquisition. If you pay that cost once per item, you pay it as many times as you have items. If you pay it once per batch, you pay it far fewer times.

The most direct form is I/O batching. Linux’s write() syscall has meaningful overhead per call. Using writev() to submit multiple buffers in one call, or buffering writes and flushing periodically, reduces that overhead. The same applies to database clients: submitting ten inserts in a single transaction is dramatically faster than ten separate auto-committed inserts because you pay the commit latency once instead of ten times.

At the CPU level, batching enables prefetching and branch prediction to work better. If you process 1000 items in a tight loop, the CPU learns the pattern. If you process one item, yield to the scheduler, come back, and process the next, you have destroyed the hardware’s ability to prefetch or predict.

This is why BufferedReader exists in Java, why BufWriter exists in Rust, and why virtually every high-throughput system buffers before flushing. The default unbuffered API is honest about what it does, but it is not appropriate for most write-heavy code paths.

// Unbuffered: one syscall per line
let mut file = File::create("output.txt")?;
for line in lines {
    file.write_all(line.as_bytes())?;  // syscall each time
}

// Buffered: one syscall per 8KB (default buffer size)
let mut writer = BufWriter::new(File::create("output.txt")?);
for line in lines {
    writer.write_all(line.as_bytes())?;  // buffered, rare flush
}
writer.flush()?;

Natural batching also connects to the single-writer principle. When one thread owns a write path, it can accumulate work and flush it in a batch without coordination overhead. Multiple writers contending to flush produce less throughput than one writer batching and flushing on its own schedule.

These Principles Compose

The most interesting thing about mechanical sympathy as a discipline is that its principles reinforce each other. Single-writer reduces coherence traffic, which means cache lines stay warm. Predictable memory access enables prefetching, which hides the latency you do pay. Natural batching reduces the number of times you touch any given lock or syscall boundary. Cache line awareness prevents the hardware-level sharing that undermines single-writer.

You rarely implement all four simultaneously from scratch. But when a hot path is slow and profiling shows cache misses, lock contention, or syscall overhead, these principles give you a vocabulary for diagnosing what went wrong and a menu of structural changes that can fix it.

The gap between hardware capability and software performance in typical systems is large. Not because the hardware is hard to reason about, but because the default abstractions, linked lists, shared mutable state, per-operation I/O, make it easy to write code that happens to be the worst case for the machine underneath. Mechanical sympathy is the habit of checking, before those patterns harden, whether you are building with the machine or against it.