Branch Prediction Has a Capacity Limit, and You Can Measure It

The branch predictor is one of the most consequential performance subsystems in a modern CPU, and one of the least discussed outside of microarchitecture circles. Most developers know that branch mispredictions are expensive, roughly 15 to 20 cycles on modern x86, but fewer think carefully about what happens when your code simply has more branches than the predictor can track simultaneously.

Lemire’s recent post asks a direct question: how many branches can your CPU actually predict at once? The answer is a specific number, and it is measurable.

What the Branch Predictor Actually Stores

The branch predictor is not a single monolithic structure. It is a hierarchy of hardware tables, and each layer has a capacity that can be exhausted.

The Branch Target Buffer (BTB) stores mappings from branch instruction addresses to their likely targets. When the CPU fetches instructions, it consults the BTB before it even decodes whether the current instruction is a branch. If the BTB has no entry for the current program counter, the CPU either stalls waiting for the instruction to decode, or makes a default guess and pays a penalty when that guess is wrong.

On top of the BTB sits a directional predictor that answers the taken-or-not-taken question. Modern CPUs predominantly use TAGE (Tagged Geometric history length) predictors, introduced by Seznec and Michaud in their 2006 paper. TAGE maintains multiple prediction tables, each indexed by a combination of the branch’s program counter and a suffix of the global branch history of a different length. Shorter history tables catch simple local patterns; longer history tables capture complex correlations between branches that executed hundreds of instructions apart. Intel’s Golden Cove microarchitecture (used in Alder Lake and Raptor Lake performance cores) implements a TAGE-family predictor capable of exploiting history lengths reaching hundreds of branches deep.

Both the BTB and the directional predictor have finite capacity. According to microarchitecture analyses documented on uops.info and in Agner Fog’s optimization manuals, Intel Raptor Lake has an L1 BTB of roughly 512 entries and an L2 BTB around 12,000 entries. AMD Zen 4 has an L1 BTB of approximately 1,024 entries with an L2 in the range of 6,500 to 7,000 entries. Apple does not publish microarchitecture documentation, but performance analyses of the M-series chips suggest a large effective BTB capacity with a small, fast L1 tier.

The Capacity Cliff

When a piece of code has more unique branch instructions in its hot working set than fit in the BTB, prediction quality degrades. The predictor evicts entries for branches it has seen before, and when those branches come around again, it has lost their history and must re-learn from scratch.

This degradation is not gradual or smooth. There is a hard cliff in the throughput curve. Below the capacity threshold, the predictor learns all branches quickly and maintains high accuracy. Beyond it, cold misses accumulate and performance drops sharply.

Lemire’s benchmark constructs this situation deliberately. The approach is to build a loop with N distinct conditional branches, all with deterministic outcomes (so a perfect predictor would never miss), and measure how loop throughput changes as N increases. For small N, all branches fit in the BTB and the predictor masters them. At some threshold, throughput drops sharply because the BTB can no longer hold the full working set.

The thresholds map directly onto the BTB tier boundaries. A performance step at N around 512 corresponds to L1 BTB overflow on Intel; a larger step at N around 12,000 corresponds to L2 overflow. On Zen 4, the steps appear at different N values reflecting that architecture’s different tier sizes.

Constructing the Benchmark

The key constraint for this kind of measurement is that each branch must be at a distinct instruction address. A single if inside a loop is one branch at one address no matter how many iterations execute. To create N distinct branch addresses, you need N separate conditional instructions in the instruction stream.

One common construction in C uses a code-generation step or macro expansion to emit N separate conditionals:

// Simplified illustration: N distinct branches at N distinct addresses
void bench_n_branches(int *data, int n, int threshold) {
    int sum = 0;
    // Unrolled manually or via codegen to N distinct branch sites
    for (int i = 0; i < n; i++) {
        if (data[i] > threshold) sum += data[i];
    }
    // Use sum to prevent dead code elimination
    (void)sum;
}

The subtlety is that a compiler will turn the loop above into a single branch at one address, not N branches. Getting N distinct branch addresses requires either a computed-goto dispatch table, a hand-unrolled sequence of N separate conditionals generated by a preprocessor or script, or a benchmark harness that jmps through an array of function pointers where each function contains a distinct branch.

Lemire’s benchmark infrastructure handles this correctly. The measurement produces a throughput curve where you can read off the BTB tier boundaries directly from where the curve steps down.

What the Numbers Show Across Architectures

On an Intel Raptor Lake system, throughput stays roughly flat for N up to about 512, shows a moderate step in the 512 to 4,096 range as L1 BTB entries spill to L2, and then hits a sharp cliff once N exceeds roughly 12,000 to 15,000 entries where the full BTB overflows.

AMD Zen 4 shows the same two-tier structure with different breakpoints. The L1 capacity of around 1,024 entries means the first step appears later than on Intel, but the L2 capacity of around 6,500 entries means the hard cliff comes earlier.

Arm’s Cortex-X4, found in recent Snapdragon 8 Gen 3 devices, has an L1 BTB of roughly 1,024 entries based on community measurements, with an L2 extending into the several-thousand range. The specifics vary by silicon revision and have been partially characterized by projects like pmu-tools and through manual microbenchmarking.

The takeaway is that these limits are consistent across vendors in rough order-of-magnitude: L1 BTBs hold hundreds to low thousands of entries, L2 BTBs hold thousands to low tens of thousands. Code that exceeds those limits in its hot working set will pay a misprediction tax regardless of how predictable the individual branch outcomes are.

Where This Shows Up in Practice

Most application code never approaches these limits. A typical hot loop has a handful of branches, well within any BTB’s capacity. The situations where BTB pressure becomes a real issue tend to cluster in a few places.

Large switch statements generate many distinct branch instructions. A switch with hundreds of cases, even with a jump table lowering, still produces branches at many different instruction addresses in the dispatch sequence. Compilers will typically prefer jump tables here, which reduces the branch count at the cost of indirect jump overhead, and the tradeoff depends on how predictable the dispatch pattern is.

Aggressive function inlining increases the total instruction count in the hot working set. More inlined code means more branch instructions brought into a single contiguous region of hot code, and if that region has more branches than fit in the BTB, you can end up with worse performance than the outlined version. This is one reason profile-guided optimization frequently outperforms static inlining heuristics: PGO can observe which branches actually run at runtime and make inlining decisions that keep the hot BTB working set compact.

Interpreter dispatch loops are a classic case. CPython’s ceval loop, V8’s Ignition bytecode interpreter, and similar systems route execution through a large dispatch structure, and the set of branch instructions involved can be substantial. This is part of the motivation for specializing interpreters: if common bytecodes get fast paths that bypass the main dispatch, the hot working set shrinks and BTB pressure drops. CPython’s specializing adaptive interpreter, introduced in Python 3.11, is partly addressing exactly this problem.

JIT-compiled code has its own version of the problem. Code emitted at runtime lands at addresses chosen by the JIT allocator, and if the JIT emits more branches than the BTB can track, emitted code that was warm can become cold from the predictor’s perspective as new code evicts old BTB entries. Tiered JIT compilation, where only the hottest traces receive full optimization, limits the total BTB footprint of emitted code.

Indirect Branches and the Spectre Complication

Conditional branches are only part of the story. Indirect branches (function pointers, virtual dispatch, computed gotos) require predicting the target address, not just the direction. Most architectures handle these with either a dedicated Indirect Branch Target Predictor or by integrating indirect prediction into the BTB using tagged entries.

After the Spectre disclosure in January 2018, many x86 systems shipped software mitigations that affect indirect branch prediction cost. Retpolines, the common mitigation for Spectre variant 2, replace indirect branches with a return-based sequence that is substantially harder for the predictor. On processors without hardware mitigations (IBRS, eIBRS), indirect branch overhead increased significantly and the BTB interaction changed. Processors since Ice Lake have improved hardware mitigations that recover most of this performance, but the Spectre era is a reminder that prediction capacity and prediction latency are separate concerns, both affected by microarchitecture choices.

Practical Takeaways

If you suspect branch misprediction is affecting performance in hot code, measuring it is straightforward on Linux:

perf stat -e branches,branch-misses ./your_program

A misprediction rate above a few percent on a hot loop warrants investigation. If the branch outcomes are predictable by design (the data is sorted, the flags rarely flip), a high misprediction rate points to BTB pressure rather than inherent unpredictability.

Reducing BTB pressure in hot paths means keeping the working set of branch instructions small. That points toward fewer, simpler conditionals rather than many small checks; toward selective rather than maximal inlining; and toward data-oriented designs where branch-heavy polymorphism gets replaced with lookup tables or sorted arrays that the predictor can learn with a smaller footprint.

The C++20 [[likely]] and [[unlikely]] attributes give the compiler directional hints that can affect code layout and branching decisions, keeping the common-case path compact. GCC’s __builtin_expect serves the same purpose in C.

Understanding that the BTB is a finite shared resource, with specific and measurable capacity limits, is the kind of hardware-level detail that connects benchmark numbers to code structure in a concrete way. Lemire’s approach of measuring the cliff directly, rather than reasoning about it abstractly, is a useful addition to the standard performance profiling toolkit.