When the Bottleneck Lives in the Cache: C++ Performance from the Hardware Up
Source: isocpp
The cache is the bottleneck. Not the algorithm, not the network, not the database. For a wide class of C++ performance problems, the CPU is already fast enough; what it is waiting on is data to arrive from memory. Understanding this shifts how you read profiler output, how you design data structures, and what you reach for first when a workload underperforms.
Michelle D’Souza’s CppCon 2025 talk “Cache Me Maybe” covers this territory for a broad audience, working through cache fundamentals, access pattern optimization, and the specific bug known as false sharing. This post takes those starting points further, grounding each technique in the hardware mechanism and filling in the measurement workflow you need to apply them in production code.
The Hardware Model
Modern CPUs have three levels of cache between the processor cores and main memory. The numbers vary by microarchitecture, but the order of magnitude is consistent across recent Intel and AMD processors:
| Level | Typical size | Access latency |
|---|---|---|
| L1 data | 32–64 KB per core | ~4 cycles (~1 ns) |
| L2 | 256 KB – 2 MB per core | ~12 cycles (~4 ns) |
| L3 | 6–96 MB shared | ~40 cycles (~13 ns) |
| DRAM | — | ~200–300 cycles (~60–100 ns) |
The gap between L1 and DRAM is roughly 50 to 75 times in latency. This is not a minor detail to optimize around at the margins; it is the dominant performance variable for memory-bound code.
The other key fact is that memory moves between DRAM and cache in 64-byte units called cache lines, regardless of how many bytes the code actually requests. Loading a single int costs the same as loading 16 of them, because the CPU fetches the whole line either way. Every technique for cache-friendly code is a consequence of this one mechanism: waste fewer bytes per cache line on data you do not need, and do not force cache lines to travel unnecessarily.
False Sharing: The Silent Multithreaded Penalty
False sharing is the most reliably destructive cache bug in concurrent C++ code. Two threads write to different variables, never touching the same data, but their variables happen to live on the same 64-byte cache line. The CPU’s cache coherence protocol, MESI, requires that when one core writes to a cache line, all other cores holding copies of that line must invalidate them. A write on core A causes core B’s copy to be invalidated; core B’s next write requires fetching the line back from core A. The two threads are now effectively serialized, with each write incurring a cross-core round trip.
The code looks innocent:
struct Counters {
std::atomic<int64_t> requests; // offset 0
std::atomic<int64_t> errors; // offset 8
};
Two independent counters, updated by separate threads. Because both fit within 64 bytes, they share a cache line, and every increment on one invalidates the other. Benchmarks consistently show 5 to 20 times slowdown under this pattern, scaling worse as core count increases.
The fix is padding. C++17 provides std::hardware_destructive_interference_size in <new> for exactly this purpose:
#ifdef __cpp_lib_hardware_interference_size
constexpr std::size_t cache_line = std::hardware_destructive_interference_size;
#else
constexpr std::size_t cache_line = 64;
#endif
struct alignas(cache_line) PaddedCounter {
std::atomic<int64_t> value{0};
};
PaddedCounter counters[MAX_THREADS];
Each counter now occupies its own cache line. The coherence traffic disappears.
Note that Clang 17 and earlier do not implement hardware_destructive_interference_size, so the fallback to a hard-coded 64 is necessary for portable code. GCC 12 and later implement it but emit a warning (-Winterference-size) if the value is baked into ABI-sensitive contexts, since the constant is technically target-dependent. The #ifdef guard handles both cases cleanly.
The inverse of false sharing is constructive interference: placing a mutex and the data it protects on the same cache line, so the thread that acquires the lock already has the data in L1. std::hardware_constructive_interference_size (also typically 64) and alignas enforce co-location for this pattern.
Data Layout: AoS vs SoA
Array of Structs groups all fields for one object together in memory. Structure of Arrays groups all values of the same field together. The choice between them is not a stylistic one; it determines how many useful bytes land in each cache line load.
// AoS: all Particle fields together
struct Particle {
float x, y, z;
float vx, vy, vz;
float mass;
int type; // 32 bytes total, two per cache line
};
std::vector<Particle> particles(N);
// SoA: each field in its own array
struct Particles {
std::vector<float> x, y, z;
std::vector<float> vx, vy, vz;
std::vector<float> mass;
std::vector<int> type;
};
When a physics simulation loop advances particle positions, it reads x, y, z, vx, vy, vz and ignores mass and type. In AoS layout, every cache line load pulls in mass and type too, wasting a portion of the available bandwidth on data the loop discards. In SoA layout, the x array delivers 16 floats per cache line, all of them used, and mass is never loaded at all for this loop.
The throughput difference on bandwidth-bound loops runs to 2 to 8 times depending on how much of the struct the inner loop ignores. Particle systems, game entity components, and columnar data processing are natural fits for SoA. Code that processes all fields of a single object at once, such as serialization or per-object decision trees, may favor AoS because one or two cache line loads bring in everything needed.
A hybrid layout, AoSoA, packs a fixed-width block of SoA data into each group, matching the width of SIMD registers. Processing 8 float x-values packed into a single AVX2 register means the inner loop is both cache-friendly and fully vectorized. Mike Acton’s CppCon 2014 talk on Data-Oriented Design makes this case in detail with game engine examples that show 10 to 50 times variance from data layout alone.
Access Patterns and the Hardware Prefetcher
The hardware prefetcher detects sequential access patterns and loads cache lines ahead of the instruction stream automatically. Iterating a std::vector<int> sequentially keeps the L1 miss rate near zero for large arrays; the prefetcher stays ahead of the loop. Random access through a linked list or a pointer-heavy tree provides no predictable pattern, so nearly every node traversal becomes an L3 miss or a DRAM access.
The performance gap between sequential and random access on a 100 MB dataset typically runs to 100 to 200 times in latency per element. This is why replacing std::list with std::vector, and accepting the O(n) copy cost for midpoint insertions, often yields dramatic speedups in practice. The algorithmic complexity argument for linked lists stops being valid once the working set exceeds L3 size and every pointer dereference becomes a memory round trip. Chandler Carruth’s CppCon 2014 talk benchmarks exactly this tradeoff, showing vector winning over list up to surprisingly large element counts.
The same logic applies to hash maps. std::unordered_map uses chained collision resolution, meaning every lookup traverses at least one extra pointer hop. Open-addressing implementations like absl::flat_hash_map store all data in a flat array; cache line utilization is higher and pointer chasing is eliminated. Benchmarks from the Abseil project show 2 to 4 times lookup speedup in cache-sensitive workloads.
For cases where the access pattern is known but not sequential, software prefetching can help:
for (int i = 0; i < n; i++) {
if (i + 16 < n)
__builtin_prefetch(data + i + 16, 0, 1);
process(data[i]);
}
The prefetch distance should be calibrated to the ratio of memory latency to loop iteration time. At roughly 200 cycles for a DRAM load and 3 cycles per iteration, prefetching 64 or more elements ahead may be warranted. This technique is narrow in applicability; the hardware prefetcher handles the common case, and manual prefetching in the wrong place adds noise without helping. Measure before adding it.
Measuring Cache Performance
No optimization work is trustworthy without measurement. Linux perf is the most accessible starting point:
# Count cache misses across the whole program
perf stat -e cache-references,cache-misses,LLC-load-misses ./your_program
# Sample which functions are responsible for LLC misses
perf record -e LLC-load-misses -g ./your_program
perf report --sort=dso,symbol
The ratio of cache-misses to cache-references gives a quick read on how cache-friendly a workload is overall. A ratio above 5 to 10 percent on a compute-heavy workload usually indicates a layout or access pattern problem worth investigating.
For structured benchmarking, Google Benchmark’s sweep pattern characterizes cache behavior by varying the working set size:
BENCHMARK(BM_Scan)->Range(1 << 10, 1 << 26);
Running the benchmark from 1 KB to 64 MB will show clear inflection points at L1, L2, and L3 boundaries. Throughput drops sharply each time the working set exceeds a cache level. This sweep pattern is the standard tool for understanding where a workload sits in the hierarchy.
Cachegrind (part of Valgrind) simulates cache behavior without requiring hardware performance counters, making it useful in virtual machines and CI environments where perf is unavailable:
valgrind --tool=cachegrind ./your_program
cg_annotate cachegrind.out.<pid> --auto=yes
Its simulated cache model is not cycle-accurate, but it identifies miss hotspots at the source line level reliably. Intel VTune’s “Memory Access” analysis provides a more detailed breakdown, including the fraction of cycles attributed to L1, L2, L3, and DRAM latency per function, and distinguishes local from remote memory accesses on NUMA systems.
The Priority Order
Cache optimization has a natural priority order based on impact and effort. Eliminating false sharing comes first; it causes order-of-magnitude regressions in concurrent code and the fix is a one-line alignas. Moving from node-based containers to contiguous storage is second, particularly for workloads that traverse large collections. Switching to SoA data layouts is third, valuable for large-scale data processing loops but requiring more structural change to the codebase. Software prefetching comes last, narrow in applicability and requiring careful measurement to use correctly.
Profiling first with perf stat -e cache-misses takes less than a minute and usually reveals which category the problem falls into. Ulrich Drepper’s “What Every Programmer Should Know About Memory” remains the deepest publicly available treatment of this hardware model, quantifying each effect with its own benchmark suite.
The hardware has not fundamentally changed in the ways that matter here. Cache lines are still 64 bytes on x86-64. DRAM is still 50 to 75 times slower than L1. The MESI protocol still serializes writes to shared cache lines. The techniques D’Souza covers at CppCon have been valid for two decades, and they remain valid because the physics of memory access has not moved.