Branch Predictor Capacity: Measuring the Limit Inside Your CPU

Branch predictors have finite capacity, and most programs never think about it. Daniel Lemire ran an experiment to measure that limit directly, posting the results, and the shape of the answer is worth understanding in detail.

The core observation is this: branch prediction accuracy stays high up to a certain number of distinct branches, then degrades sharply at one or more thresholds. Those thresholds are not arbitrary. They correspond to the capacity boundaries of the hardware structures inside the CPU that store branch information. Mapping them tells you something concrete about the hardware that Intel and AMD rarely document officially.

The Branch Target Buffer Is a Cache

Modern CPUs predict branches by storing information in structures that behave like caches. The Branch Target Buffer (BTB) maps instruction addresses to predicted target addresses. A companion structure, the Pattern History Table (PHT) or its more sophisticated descendants, tracks the outcome history of each branch, whether it was taken or not-taken. Both structures have finite size, organized in multiple levels much like L1/L2/L3 caches.

Intel’s Golden Cove microarchitecture, used in Alder Lake and Raptor Lake P-cores, has a small, very fast L0 BTB intended primarily for tight loops, a larger L1 BTB, and an L2 BTB with thousands of entries. AMD Zen 4 uses a similar multi-level organization. The exact sizes are not published in official microarchitecture documentation; they are reverse-engineered through benchmarks of precisely the kind Lemire describes. Agner Fog’s microarchitecture guides have done this systematically across CPU generations for years, using the same technique.

When the working set of branches in your code exceeds the BTB’s capacity, the CPU faces something analogous to a cache miss. It cannot predict the target before fetching the next instruction, so it either guesses wrong or stalls. The performance cost is a pipeline flush: all speculatively executed instructions behind the mispredicted branch get discarded. On modern x86 cores this costs roughly 15 to 20 cycles. On a CPU running at 4 GHz, 20 cycles is 5 nanoseconds. Across millions of branches per second, that accumulates quickly.

How the Probing Experiment Works

A typical capacity-probing benchmark works by generating N distinct conditional branches and measuring how prediction accuracy changes as N increases:

// Random outcomes stored in array of size N
for (int i = 0; i < ITERATIONS; i++) {
    int idx = i % N;
    if (outcomes[idx]) {
        sum += data[idx];
    }
}

You then read hardware performance counters to get the misprediction rate. On Intel processors, BR_MISP_RETIRED.ALL_BRANCHES gives you the count directly. On Linux:

perf stat -e branches,branch-misses ./probe N

Plotting misprediction rate against N produces a step function, not a smooth curve. Accuracy stays near 100% across a plateau, then drops sharply, recovers to a new plateau, then drops again. Each step corresponds to a BTB level being saturated. The first drop tells you the L1 BTB size. The second tells you the L2 BTB size. At some point the CPU is working entirely from its fallback predictor, which for branches without history looks like a coin flip.

TAGE and the Prediction Algorithm

Raw BTB capacity is only part of the picture. The prediction algorithm determines how well the CPU uses whatever history it has stored. For decades, CPUs used two-bit saturating counters per branch: the branch was predicted taken when the counter was above the midpoint, not-taken below. This handles strongly biased branches well but fails on correlated or complex outcome patterns.

Modern processors use TAGE, the Tagged Geometric history length predictor, developed by André Seznec and Pierre Michaud and described in detail in their 2006 paper. TAGE maintains multiple component tables indexed by different lengths of the global branch history, from a short window of recent branches to hundreds of branches deep. Each component uses a tag derived from the program counter and history to reduce aliasing between branches that hash to the same entry. The final prediction comes from the longest-history component that matches, giving TAGE the ability to track complex correlation patterns that older predictors miss entirely.

The consequence is that running out of prediction accuracy is not purely about BTB size. You can have all your branches resident in the BTB but still see high misprediction rates if the outcome patterns require more history context than any TAGE component tracks, or if alias patterns in the table work against you. The two limits, storage and history depth, are distinct and produce different signatures in the kind of experiment Lemire ran.

The Reorder Buffer Interaction

Branch prediction capacity interacts with the Reorder Buffer (ROB), the structure that holds in-flight instructions while out-of-order execution proceeds. Intel Golden Cove has a 512-entry ROB. AMD Zen 4 has a 320-entry ROB. The number of in-flight branches at any moment is bounded by how many instructions fit in the ROB.

This makes branch misprediction on a deep out-of-order machine disproportionately expensive. When a misprediction is detected, hundreds of in-flight instructions may be squashed, not just the handful behind the branch in program order. The combination of deeper ROBs and higher IPC targets means that prediction accuracy has become more important over time, not less, even as branch predictors themselves have become more sophisticated.

What This Means for Real Code

Real code has natural branch working sets. An interpreter’s dispatch loop, a parser, a polymorphic virtual dispatch site, a finite state machine: each has a characteristic number of branches that either fits inside a BTB level or does not.

Consider a bytecode interpreter with N opcodes:

while (pc < end) {
    switch (*pc++) {
        case OP_ADD:  sum += *sp++;  break;
        case OP_SUB:  sum -= *sp++;  break;
        case OP_JMP:  pc = target;   break;
        /* ... up to N opcodes ... */
    }
}

Each case label is a branch target. If N stays below the L1 BTB capacity, the CPU predicts nearly all of them correctly. Push N past the L2 BTB boundary and a large fraction of dispatch operations pay the misprediction penalty. This is part of why computed gotos, which dispatch through a table of direct label addresses, can outperform switch in interpreter loops: the structure gives the branch predictor more consistent targets to learn, even when the opcode count is similar.

JIT compilers treat BTB pressure as a first-class concern. Tiered JIT systems like V8’s Maglev and Turbofan generate more specialized code at higher tiers, which reduces the number of branches needed per operation and makes remaining branches more predictable. The performance improvement from specialization is partly about eliminating work, but it is also about reducing the branch working set to fit inside the faster BTB levels.

Profile-guided optimization helps for the same reason. By profiling actual execution, the compiler can pack hot-path branches into a smaller code region, reducing effective BTB pressure on the paths that matter. Rarely taken branches get pushed to cold sections of the binary where their BTB pressure does not affect the hot path.

Profiling Branch Predictor Pressure

For practical diagnosis, perf stat gives you the aggregate picture:

perf stat -e branches,branch-misses,branch-load-misses ./binary

For precise attribution to specific branch sites, perf record with precise sampling:

perf record -e branch-misses:pp -g ./binary
perf report

Intel’s VTune and AMD’s uProf expose per-branch misprediction rates and can flag hot branch sites automatically. On Linux without those tools, perf annotate shows misprediction rates inline with the assembly.

What Lemire’s experiment adds beyond raw profiling is a capacity characterization: not just how many misses, but at what branch working set size accuracy starts to degrade. That distinction matters when you’re deciding whether to split a hot function, inline a call, unroll a loop, or restructure dispatch logic. The profiler tells you the symptom; the capacity curve tells you whether you’re close to a hardware boundary or well inside it.

The branch predictor is a hardware resource with measurable, finite capacity. Treating it the same way you’d treat the L1 cache, knowing roughly what fits and what doesn’t, leads to better intuitions about where CPU time actually goes and what structural changes to code are worth making.