· 6 min read ·

The Branch Prediction Problem Every Interpreter Has, and What Rust's `become` Keyword Does About It

Source: lobsters

Matt Keeter recently published a detailed walkthrough of building a tail-call interpreter in nightly Rust, using the language’s still-experimental become keyword. The post is worth reading on its own terms, but the technique it demonstrates sits in a longer story about interpreter performance, branch prediction, and one of the most contentious ongoing debates in Rust’s language design. There is quite a bit of context that makes the tradeoffs clearer.

Why Dispatch Is the Problem

When you write a bytecode interpreter, the naive structure is a loop with a match (or switch in C) on the current opcode:

fn interpret(ops: &[Op]) -> i64 {
    let mut stack = Vec::new();
    let mut ip = 0;
    loop {
        match ops[ip] {
            Op::Push(v) => { stack.push(v); ip += 1; }
            Op::Add => {
                let b = stack.pop().unwrap();
                let a = stack.pop().unwrap();
                stack.push(a + b);
                ip += 1;
            }
            Op::Halt => return stack.pop().unwrap_or(0),
        }
    }
}

This is readable and correct. The problem is that every opcode execution flows through a single indirect branch: the jump that dispatches from the top of the loop to the handler for the current opcode. Modern CPUs use branch predictors to speculate ahead, but a single dispatch branch that can jump to dozens of different targets is hard to predict. The predictor has to learn a single distribution over all opcode transitions, rather than per-opcode transition distributions.

In a real interpreter running realistic programs, this one branch accounts for a disproportionate share of branch mispredictions. The effect is well-documented and was one of the primary motivations for CPython’s use of computed goto dispatch, enabled by default on GCC and Clang since Python 3.6.

What Computed Goto Actually Does

CPython’s computed goto technique exploits a GCC extension: &&label takes the address of a label as a void*, and goto *ptr jumps to an arbitrary address. This lets CPython store handler addresses in an array and end each handler with a direct jump to the next one, rather than looping back to a central dispatch branch:

static void *dispatch_table[] = { &&TARGET_PUSH, &&TARGET_ADD, &&TARGET_HALT };

TARGET_PUSH:
    // push logic
    goto *dispatch_table[*++ip];

TARGET_ADD:
    // add logic
    goto *dispatch_table[*++ip];

Now each handler ends with its own indirect branch. The branch predictor can learn per-handler distributions separately: after LOAD_FAST, the next instruction is often LOAD_FAST or BINARY_OP; after POP_JUMP_IF_FALSE, the distribution is different. This is the key insight. More branch targets means each individual predictor entry is less loaded, and the CPU can specialize per-opcode.

The catch is that &&label is not standard C. It is a GNU extension, unavailable in MSVC and not expressible in Rust at all. Rust has no goto, no label addresses, and no equivalent mechanism in stable code.

Tail Calls as a Portable Equivalent

The tail-call dispatch technique achieves the same per-handler branch isolation without needing label addresses. Instead of one function with multiple labeled sections, you write each handler as a separate function that ends by calling the next handler in tail position:

#![feature(explicit_tail_calls)]

type Handler = fn(&[u8], usize, &mut Vec<i64>) -> i64;

static HANDLERS: &[Handler] = &[op_push, op_add, op_halt];

fn op_push(ops: &[u8], ip: usize, stack: &mut Vec<i64>) -> i64 {
    stack.push(ops[ip] as i64);
    let next_op = ops[ip + 1] as usize;
    become HANDLERS[next_op](ops, ip + 1, stack)
}

fn op_add(ops: &[u8], ip: usize, stack: &mut Vec<i64>) -> i64 {
    let b = stack.pop().unwrap();
    let a = stack.pop().unwrap();
    stack.push(a + b);
    let next_op = ops[ip + 1] as usize;
    become HANDLERS[next_op](ops, ip + 1, stack)
}

fn op_halt(_ops: &[u8], _ip: usize, stack: &mut Vec<i64>) -> i64 {
    stack.pop().unwrap_or(0)
}

The become keyword is what makes this work. Without it, writing return HANDLERS[next_op](ops, ip + 1, stack) might or might not be tail-call optimized by LLVM; the compiler might insert a stack frame for each call. With become, the compiler is required to emit LLVM’s musttail attribute. If the call cannot be compiled as a genuine tail call, the compilation fails. You get a guarantee, not a hope.

After optimization, each handler ends with a single indirect branch to the next one. The call stack does not grow. The branch predictor sees per-handler branches. The assembly is essentially the same structure as computed goto, but derived from ordinary function calls.

The become Keyword: History and Status

Rust has wanted some form of guaranteed tail calls for a long time. The discussion goes back to at least 2015. RFC 3407, filed in late 2022, formalized the become keyword proposal. The feature was unstable-gated as #![feature(explicit_tail_calls)] and has been available on nightly since mid-2023.

The design is intentionally narrow. Rust does not add general tail-call optimization to all recursive calls; it adds an explicit syntax that programmers must opt into. This reflects a deliberate tradeoff: implicit TCO is hard to reason about and can change behavior unexpectedly when code is refactored. Explicit become makes the optimization part of the contract.

One constraint follows directly from the LLVM musttail semantics: the caller and callee must have the same calling convention, and the types must match exactly. This is why all handlers in a tail-call interpreter must share the same function signature. You cannot tail-call from a handler into a function with different parameters. In practice this usually means threading all VM state through a single struct or accepting a fixed set of arguments, which is a common pattern in interpreter implementations anyway.

What This Costs

The tradeoffs are real. The most immediate one is that #![feature(explicit_tail_calls)] is nightly-only. There is no timeline for stabilization, and the feature has been sitting in an uncertain state while the lang team discusses edge cases around unwinding, argument evaluation order, and interaction with async. For a production interpreter, this is a meaningful constraint.

Beyond the nightly requirement, the architecture change is non-trivial. A switch-dispatch interpreter is one function with a loop; a tail-call interpreter is N+1 functions with homogeneous signatures. If your opcode handlers need different data, you have to unify those needs into a single state type passed through each call. This is usually fine, but it restructures the code significantly and can complicate things like debugging and profiling, where the call tree becomes a flat chain of tail calls rather than a visible call stack.

There is also the question of whether the performance improvement is large enough to justify the complexity. Computed goto’s advantage over switch dispatch is well-measured in CPython, roughly 15-20% on typical workloads. Tail-call dispatch should produce comparable assembly, so similar gains are plausible. But the exact numbers depend heavily on the interpreter’s opcode mix, the typical program’s instruction stream, and the target microarchitecture. The improvement is real for instruction-intensive workloads and negligible for interpreters that spend most of their time in a small number of expensive opcodes.

Broader Context

This technique is not unique to Rust. The Wasm3 interpreter uses tail-call dispatch in C through a similar pattern, relying on Clang’s TCO behavior with carefully structured code. The V8 JavaScript engine uses a combination of dispatch approaches depending on the tier. What makes the Rust case interesting is the combination of musttail guarantees, type safety, and the potential to bring this into a language that otherwise discourages unsafe low-level tricks.

For people building interpreters in Rust today, the practical message is: if you are on nightly and care about interpreter throughput, the technique is worth benchmarking against your switch-dispatch baseline. If you are on stable, the traditional loop-and-match approach remains the right default, possibly paired with careful attention to hot-path inlining and opcode representation.

The longer-term question is whether explicit_tail_calls stabilizes and what form it takes. The RFC process has been slow, but the feature has real users and a clear use case. Interpreter authors have been waiting for this kind of guarantee in Rust for a decade. Keeter’s post is a concrete demonstration that the mechanism works as intended, which is useful evidence for the stabilization discussion.

Was this interesting?