· 6 min read ·

Branch Prediction Has a Working Set Problem, and Measuring It Is Instructive

Source: lobsters

Daniel Lemire’s recent post asking how many branches your CPU can predict poses a question most developers never think to ask. The answer is not a single number, it varies by microarchitecture, and the way you measure it reveals how branch prediction hardware is structured in a way that maps almost exactly onto the data cache hierarchy.

What Branch Prediction Hardware Stores

A branch predictor is not a single structure. It is several cooperating units:

  • Branch Target Buffer (BTB): maps each branch instruction’s program counter to its predicted target address. Without a BTB entry, the CPU cannot start fetching instructions along the predicted path and must wait for instruction decode to determine where to go next.
  • Branch History Table (BHT/PHT): stores directional predictions, taken vs. not taken, typically as 2-bit saturating counters indexed by some combination of the branch PC and recent branch history.
  • Return Address Stack (RAS): a small hardware stack, typically 8 to 32 entries, tracking return addresses for function calls.
  • Indirect Branch Predictor: handles computed jumps where the target varies at runtime.

The BTB is the capacity bottleneck. If the BTB cannot hold an entry for a given branch site, the CPU falls back to some default prediction and pays the full frontend penalty when it turns out to be wrong. On a modern out-of-order core with a 15-to-20 stage pipeline, a misprediction costs somewhere between 15 and 20 cycles, which is expensive in any tight loop.

The Measurement Methodology

The standard approach to finding BTB capacity is to construct a micro-benchmark with exactly N distinct branch sites, all visited in a hot loop, and measure branch mispredictions as N increases. In C, the skeleton looks something like this:

#include <stdint.h>

void run_n_branches(int n, int iterations, volatile int *counters) {
    for (int iter = 0; iter < iterations; iter++) {
        for (int i = 0; i < n; i++) {
            // Each array element lives at a different address,
            // so each branch here is at a unique PC after unrolling
            if (counters[i] & 1) {
                counters[i]++;
            }
        }
    }
}

In practice this requires care. A naive loop like the one above may have its branches fused or optimized by the compiler into a smaller number of distinct sites. Ensuring N distinct branch PCs requires either hand-written assembly, programmatically generated code, or computed gotos. The benchmark sweeps N from a small value up to tens of thousands, measuring branch-misses per branch executed using perf stat -e branch-misses,branches or hardware performance counters read via rdpmc.

The signal is a phase transition. Below the BTB capacity, misprediction rate sits near zero. The predictor has a live entry for every branch site and predicts each one correctly. As N crosses the BTB capacity, entries start evicting to make room for new ones. The misprediction rate climbs, eventually approaching what you would expect from random prediction on a 50/50 taken distribution.

This is structurally identical to the working set transition you observe when measuring cache capacity: smooth performance up to the threshold, then a cliff.

Numbers Across Microarchitectures

The capacity varies considerably, and it is tiered. An L1 BTB is small but adds no fetch-cycle overhead on a hit; an L2 BTB is larger but costs a cycle or two; beyond that, the frontend stalls waiting for a target.

On Intel’s Golden Cove cores (Alder Lake and Raptor Lake P-cores), Intel’s optimization reference manual documents the L1 BTB at 12,288 entries. Gracemont E-cores are smaller. Skylake-era parts had roughly 4,096 entries at L1. AMD’s Zen 4 substantially expanded its BTB over Zen 3; Agner Fog’s microarchitecture guide documents the capacity progression across AMD generations in detail, and the trend is consistently upward.

Apple Silicon sits in a different category. Empirical measurements of the M-series Firestorm and Everest cores suggest BTB capacities considerably larger than contemporary x86 parts. This is consistent with Apple’s design philosophy of allocating transistor area to reduce frontend latency. When the BTB can absorb tens of thousands of distinct branch sites, a whole class of performance pathology that affects x86 workloads simply does not appear.

Where This Matters in Real Code

For ordinary application code, BTB capacity is invisible. A function with a few conditionals, a loop or two, and some early returns is nowhere near the limit on any modern CPU. The capacity cliff becomes relevant in specific structural situations.

Bytecode interpreters are the clearest case. An interpreter using direct-threaded dispatch via computed gotos has one branch site per opcode handler. A rich bytecode set with 200 opcodes means 200 distinct branch PCs in the hot dispatch loop. When the interpreter handles a diverse mix of opcodes in a tight loop, it starts competing with the BTB’s capacity. CPython’s ceval.c main loop, YARV in Ruby, and similar interpreter implementations all sit in this regime. It is part of why specializing interpreters that narrow the opcode mix for common type patterns can outperform generic ones even without better branch prediction per se: they reduce the number of distinct branch sites visited in the hot path.

JIT compilers face an analogous problem. A JIT that generates many small methods, each containing several branches, can accumulate a total count of live branch sites in hot code that exceeds BTB capacity on smaller microarchitectures. Aggressive inlining, which JITs like V8’s Turbofan and SpiderMonkey’s Ion pursue for other reasons, has the side effect of consolidating branch sites and reducing BTB pressure. The code gets larger in instruction bytes, but the number of distinct branch PCs in the working set of the hot path decreases.

Virtual dispatch in C++ contributes when a hot event-processing or simulation loop calls through many distinct vtable entries. Each indirect call site is a separate BTB entry. An object system with hundreds of active polymorphic call sites in a tight update loop can saturate the BTB on Intel Skylake-era hardware even though it fits comfortably on Apple M-series or AMD Zen 4.

Diagnosing BTB Pressure in Practice

perf stat -e branch-misses,branches ./binary gives you a misprediction rate for the whole process. A rate above a few percent in a loop you expect to be well-predicted is a signal worth investigating. perf record -e branch-misses -g ./binary followed by perf report identifies which specific branch sites are mispredicting most.

For structured capacity experiments, Travis Downs’ uarch-bench provides infrastructure for exactly this kind of hardware counter sweep across microarchitectures. Agner Fog’s test programs serve as a well-validated baseline.

Once you identify a BTB-pressured hot path, the fixes are structurally similar to data cache fixes: reduce the working set of distinct branch sites. That means inlining call sites, narrowing the range of code paths visited in the critical loop, or restructuring dispatch from an indirect jump over many handlers to a more direct call graph with fewer distinct PCs.

The Cache Analogy Is More Than Superficial

The BTB behaves like a set-associative cache in a precise technical sense. It has a capacity, an associativity (typically 4-way or 8-way for L1), a replacement policy, and evictions under pressure. Exceeding its capacity produces a performance cliff that mirrors cache thrashing. The analysis tools, the fixes, and the mental model all transfer.

Most performance analysis focuses on data access patterns and instruction throughput. The front-end branch predictor sits upstream of all of that, and when its working set overflows, it silently degrades throughput in ways that do not show up in straightforward profiling. Knowing the capacity of the BTB on your target hardware, and knowing what code structure can overflow it, is the kind of microarchitecture knowledge that pays off specifically in the places where general performance intuition fails.

Was this interesting?