· 6 min read ·

Branch Predictor Capacity: The Microarchitectural Limit Hidden in Every Hot Loop

Source: lobsters

The CPU’s branch predictor is one of the most consequential pieces of microarchitecture in modern processors, and it has a hard capacity limit that most software developers never encounter until they hit it. Daniel Lemire’s recent benchmark measures the point at which a modern CPU’s branch predictor stops being able to track a program’s control flow, producing a sharp performance cliff visible in cycles-per-branch measurements.

The benchmark is simple in concept: create a loop with N distinct conditional branches, increase N, and observe when performance degrades. The degradation comes from the Branch Target Buffer (BTB), a cache-like structure that maps branch instruction addresses to their predicted targets. When N exceeds the BTB’s capacity, entries for recent branches evict entries for earlier ones, and prediction accuracy collapses.

The Predictor Hierarchy

Modern branch predictors are not a single structure. They operate as a multi-level hierarchy, analogous to L1/L2/L3 caches but for control flow prediction. Intel’s Golden Cove microarchitecture (12th and 13th generation Core) has an L1 BTB of roughly 1,024 entries with one-cycle latency, backed by an L2 BTB of approximately 12,288 entries. AMD’s Zen 5, released in 2024, goes further: a 1,024-entry L1, an 8,192-entry L2, and a third level of approximately 32,768 entries that Zen 4 first introduced. Apple Silicon’s M-series processors publish no official specs, but independent analysis by Chips & Cheese and Anandtech places the M1 Firestorm P-core’s L2 BTB at approximately 10,000 to 12,000 entries, with each level adding latency.

The practical result is a staircase of performance. Code whose branch working set fits in the L1 BTB runs at peak prediction bandwidth. Spilling into the L2 adds one to two cycles of front-end latency per miss. Spilling beyond the L2 capacity causes full mispredictions, each costing 18 to 22 cycles on x86 and 14 to 18 cycles on Apple’s pipeline.

TAGE: The Design Behind Modern Branch Prediction

The taken/not-taken prediction is handled by a separate structure from the BTB. The dominant design in modern shipping CPUs is TAGE (Tagged Geometric History Length), proposed by André Seznec and Pierre Michaud in their 2006 paper. TAGE maintains a base predictor plus N tagged tables, where each table uses a history length from a geometric series: table T_1 correlates with 2 bits of recent branch history, T_2 with 4, T_3 with 8, and so on. Prediction comes from the longest-history table with a matching tag for the current (PC, history) pair.

This design lets TAGE correlate a branch outcome with events that happened dozens of other branches earlier. A branch that is always taken after a specific sequence of 32 prior branches is predicted correctly even when simpler predictors fail. The TAGE-SC-L variant that won the Championship Branch Prediction competition in 2014 uses history lengths up to 640 bits. Shipping hardware is more conservative: Intel’s Golden Cove uses up to approximately 96 bits of global history, Zen 4 around 256 bits, and analysis of Apple’s M1 suggests roughly 200 bits.

Two conditions combine to produce the predictor’s failure mode: the number of distinct branch sites exceeds the tagged tables’ capacity for differentiation, and branch outcomes depend on correlations extending beyond the history window the predictor tracks. Lemire’s benchmark creates the aliasing condition directly by introducing more branch sites than the hardware can distinguish.

Measuring the Saturation Point

Finding the BTB saturation point follows a methodology Travis Downs documented in his work on CPU microarchitectural limits. You create a controlled benchmark with exactly N distinct branch sites, measure cycles per branch across a range of N values, and look for the inflection points.

// Sweep N to find BTB saturation.
// data[i] values are random; branch outcomes are data-dependent.
uint64_t branch_sweep(int N, const int *data, int len) {
    uint64_t sum = 0;
    for (int iter = 0; iter < len; iter++) {
        int i = iter % N;
        if (data[i] > 0) sum += data[i];
    }
    return sum;
}

A valid benchmark places the N branches at fixed, separated memory addresses so the BTB sees them as N distinct instruction addresses rather than a single address with varying data. The BTB is indexed by the program counter of the branch instruction, not by runtime values, so the distinction matters: you are stressing the address-indexed lookup, not the correlation history. Once N exceeds the L1 BTB, cycles per iteration shows the first step. Exceeding the L2 produces the sharper transition. Tracking mispredictions via perf stat -e branch-misses confirms which transition corresponds to which level.

At a 4 GHz clock with a 20-cycle misprediction penalty, a branch executing every 4 cycles at 50% misprediction rate contributes roughly 10 cycles of wasted work per execution. A loop with four operations per branch running at that rate achieves approximately 30% of its theoretical throughput ceiling. Tracking IPC alongside misprediction rate separates BTB misses, which add front-end latency, from full mispredictions, which flush the pipeline.

Where This Appears in Real Code

Interpreters are the canonical case. A bytecode dispatch loop executes one indirect branch per instruction, each targeting one of many handler addresses. CPython’s eval loop, the JVM’s interpreter tier, and SQL query evaluation engines all execute millions of such branches per second. When the number of distinct bytecodes encountered in a running program exceeds the indirect branch predictor capacity (ITTAGE, the indirect-target variant of TAGE), prediction accuracy drops for the entire dispatch path.

Spectre mitigations made this substantially worse after 2018. Retpoline converted each indirect branch into a call targeting a spin loop, defeating ITTAGE entirely. Performance regressions on indirect-branch-heavy workloads ranged from 5% to over 30% depending on the specific workload. Intel’s enhanced IBRS, available from Tiger Lake onward, restored hardware prediction while maintaining privilege-level isolation, but software on older hardware or under certain hypervisors continues to pay the retpoline cost.

Large switch statements trigger similar pressure. A switch over 200 cases compiles to a jump table, and the indirect branch through that table requires ITTAGE to correctly predict the target. Linux kernel dispatch paths, graphics driver command processors, and state machines with many states all generate this load. The BOLT binary optimizer, developed at Meta, addresses BTB pressure from the code layout side: it reorders basic blocks so that hot branch targets cluster within a smaller address range, reducing BTB footprint without source changes. Measured speedups on large binaries like Clang and MySQL have ranged from 10% to 20%.

What Developers Can Do

Profile-guided optimization shifts the tradeoff in useful ways. PGO-informed compilation knows which branches are hot and can apply cmov (conditional move) transformations to branches with high misprediction rates, trading the prediction penalty for a data dependency. For branches that are highly predictable in production, the compiler leaves them as branches; for branches where the predictor consistently fails, the data dependency of cmov is cheaper than a 20-cycle misprediction flush.

AutoFDO, Google’s approach of sampling perf branch records without instrumentation overhead, achieves similar results in production. Combined with BOLT for layout optimization, these tools recover a significant fraction of BTB-related losses without requiring changes to application source code. The combination is particularly effective for large server binaries where the branch footprint of the hot path exceeds a single BTB level.

There is also the option of restructuring the code itself. Computed gotos (a GCC extension used by CPython’s ceval loop in threaded mode) place each dispatch target at a known address, improving ITTAGE accuracy compared to a central switch-based dispatch. Lua’s VM uses the same technique. The tradeoff is portability: computed gotos are not standard C, and the benefit disappears if the ITTAGE is already saturated from too many distinct targets.

The numbers Lemire published make the capacity cliffs concrete in a way that vendor marketing never does. Processors are described by clock speed, core count, and memory bandwidth. The BTB depth and TAGE history length appear on no standard benchmark chart, yet they determine whether an interpreter running on Zen 5 outperforms the same code on Zen 4, and whether running BOLT on a production binary is worth the engineering effort. Measuring the predictor’s behavior rather than assuming it is unlimited is the prerequisite for understanding where your code stands relative to the hardware beneath it.

Was this interesting?