Tail-Call Dispatch: What Rust's `become` Keyword Enables for Interpreter Authors
Source: lobsters
Every bytecode interpreter has a tight inner loop. Fetch the next opcode, execute a handler, repeat. The cost of that dispatch step sets a hard ceiling on what the interpreter can achieve per unit of CPU time. Engineering choices made early compound across every instruction executed for the lifetime of the runtime, and the choice of dispatch technique is the most consequential of them.
C programmers have had a well-known answer to the dispatch performance problem for decades: computed-goto. It is a GCC/Clang extension that Rust has no equivalent for. Until recently, that left Rust interpreter authors with a genuine performance deficit relative to C regardless of how well they wrote everything else. Matt Keeter’s recent post about building a tail-call interpreter using nightly Rust is a concrete demonstration of how the become keyword closes that gap.
Dispatch Techniques and Why the Differences Matter
The simplest bytecode interpreter uses a central switch statement:
fn run(tape: &[u8], state: &mut State) {
let mut ip = 0;
loop {
match tape[ip] {
OP_ADD => { /* execute */ ip += 1; }
OP_SUB => { /* execute */ ip += 1; }
OP_HALT => break,
_ => unreachable!(),
}
}
}
This is portable, readable, and measurably slower than the alternatives for hot loops. The problem is the branch predictor. Every dispatch goes through the same indirect jump instruction with the same branch history entry. The CPU sees a stream of unpredictable targets and mispredicts regularly, costing 15 to 20 cycles per dispatch on modern x86-64. For a simple interpreter benchmark running millions of opcodes per second, those mispredictions accumulate substantially.
C’s answer has been computed-goto since at least CPython 3.1. Using a GCC/Clang extension (&&label takes the address of a local label), each opcode handler ends with its own dispatch jump:
static void* dispatch_table[] = { &&label_ADD, &&label_SUB, &&label_HALT };
#define DISPATCH goto *dispatch_table[bytecode[ip++]]
label_ADD: /* execute */ DISPATCH;
label_SUB: /* execute */ DISPATCH;
label_HALT: return;
Each DISPATCH is a separate indirect branch instruction. The branch predictor builds independent history for each opcode site. A sequence of ADD instructions has a consistent, predictable dispatch pattern; a tight loop body gets its own prediction table entries. CPython uses this on GCC and Clang builds, Lua’s VM uses it, and wasm3 generalizes it into tail-call form via [[clang::musttail]]. The benchmark improvement over switch dispatch is roughly 15 to 30 percent on interpreter-heavy workloads.
Rust has no goto, no label addresses, and no equivalent extension. This is not an oversight; Rust’s structured control flow and drop semantics make unstructured jumps semantically problematic. The result was that Rust interpreter authors either accepted switch dispatch performance, relied on LLVM’s heuristic tail-call optimization without guarantees, or implemented a trampoline.
The Trampoline Alternative
A trampoline replaces direct dispatch with a loop over returned function pointers or enum variants:
enum Step {
Continue(fn(&mut State) -> Step),
Done(Value),
}
fn run(state: &mut State) -> Value {
let mut step = initial_step(state);
loop {
match step {
Step::Continue(f) => step = f(state),
Step::Done(v) => return v,
}
}
}
This is safe, portable, and avoids growing the stack. It also carries overhead: the enum discriminant adds a branch per iteration, the function pointer call is indirect, and the structure inverts the natural flow of opcode handlers. Instead of each handler calling the next, each handler returns a description of what to call next. The inversion complicates code that needs to pass state between steps and makes the handler implementations harder to reason about in isolation. The trampoline is a reasonable solution for moderate-performance interpreters, but it gives up the per-site prediction behavior that makes computed-goto fast.
Tail-Call Dispatch with become
The insight behind tail-call dispatch is that computed-goto’s performance comes not from the goto itself but from having one indirect branch per opcode site. A tail call achieves the same thing: each opcode handler is a function, each one ends by calling the next handler or the central dispatcher, and because it is a true tail call, no stack frame accumulates. The call chain transforms into a series of jumps through the frame of the outermost call.
The obstacle in Rust was that LLVM’s tail call optimization is heuristic. A call in tail position might become a tail call in a release build but will not at -O0, or under certain optimization level combinations, or when the compiler inserts drop glue after the call site. There was no way to assert “this must be a tail call or fail to compile.”
RFC 3407, tracked as rust-lang/rust#112788, introduced the become keyword to fix this. Enabled on nightly with #![feature(explicit_tail_calls)], become is a guaranteed tail call. If the call cannot be emitted in tail position because pending drops exist or the call is not structurally in tail position, compilation fails with a diagnostic. There is no silent fallback to a normal call.
#![feature(explicit_tail_calls)]
fn op_add(ip: usize, tape: &[u8], state: &mut State) -> Value {
let a = state.regs[tape[ip + 1] as usize];
let b = state.regs[tape[ip + 2] as usize];
state.regs[tape[ip + 3] as usize] = a + b;
become dispatch(ip + 4, tape, state)
}
fn dispatch(ip: usize, tape: &[u8], state: &mut State) -> Value {
match tape[ip] {
OP_ADD => become op_add(ip + 1, tape, state),
OP_HALT => state.regs[0],
_ => unreachable!(),
}
}
LLVM receives a musttail annotation on these calls and generates one indirect branch per opcode site, exactly what computed-goto produces. The branch predictor sees the same distribution of call sites and can build per-site history. When the compiler inlines dispatch into each opcode handler, the indirect branch may further reduce to a direct conditional branch.
The guarantee that become provides over relying on LLVM heuristics is most valuable in two specific scenarios. First, in debug builds at -O0, LLVM does not apply TCO. Without become, a deeply recursive interpreter that passes tests in release crashes in debug mode with a stack overflow; with become, the musttail annotation forces tail call semantics regardless of optimization level. Second, when a refactor inadvertently introduces a local variable with Drop between the become call and the end of the function, the compiler fails with a clear error rather than silently reverting to a call that grows the stack.
Comparison with Other Systems Languages
Clang 13 introduced [[clang::musttail]] for C and C++, which is the direct C analog. wasm3, widely considered the fastest pure-interpreter WebAssembly runtime, uses it throughout. Each opcode is a C function; each ends with a return through [[clang::musttail]]; the compiler generates the tail-call dispatch chain. The wasm3 approach benchmarks at roughly 10 to 20 times slower than native code, substantially faster than switch-based Wasm interpreters.
Zig has @call(.always_tail, f, args), a builtin that guarantees a tail call or fails to compile. It is semantically equivalent to become and serves the same use case. Zig’s own bytecode VM uses it in its dispatch loop.
OCaml and Scheme guarantee tail calls at the language level for calls in syntactic tail position. No annotation is needed; the language specification mandates the behavior. This is stronger than what Rust provides, but it is also a constraint that C and Rust avoid by default because tail calls interact with stack-based debugging tools, profilers that depend on frame unwinding, and certain sanitizer modes in ways that require careful handling.
Go has no TCO guarantee and no mechanism to request one. The Go team’s position has been that goroutines address the stack growth problem for most cases, and tail call optimization is not a priority for the language.
| Language | Guaranteed TCO | Mechanism |
|---|---|---|
| C (GCC/Clang) | No | [[clang::musttail]] (Clang 13+); heuristic on GCC |
| Rust stable | No | Relies on LLVM heuristic; no annotation |
| Rust nightly | Yes, explicit | become (RFC 3407), fails to compile if impossible |
| Zig | Yes, explicit | @call(.always_tail, f, args) |
| OCaml | Yes, implicit | Mandated by language for tail position calls |
| Scheme | Yes, implicit | Mandated by R7RS/R6RS |
| Go | No | No mechanism; goroutines address stack growth instead |
Why Rust’s Drop Semantics Make This Hard
In C, function calls carry no ownership obligations. The programmer manages cleanup manually, and there is nothing stopping a tail call from reusing the current stack frame. In Rust, every value with a Drop implementation that is live at a call site creates an obligation that must execute before the call returns. A tail call cannot honor that obligation: the current frame is gone by the time the callee finishes.
The compiler enforces this at become sites. If any live variable would require drop glue after the call, the error is precise: the compiler names the variable and explains why it cannot be dropped before the tail call. This means tail-call dispatch patterns in Rust naturally pass all state through parameters rather than holding owned values on the stack. In an interpreter, the canonical shape is a mutable reference to a shared State struct, which carries no drop obligation at the call site and passes through cleanly.
async fn is the one context where become cannot help. Async functions compile to state machines that persist across .await points; the frame structure is fundamentally incompatible with tail call frame reuse. Async interpreter implementations should use an explicit loop or a trampoline.
The Stabilization Path
RFC 3407 was accepted, and the become implementation has been on nightly for some time. The blockers for stabilization are implementation hardening, not semantic disagreements. The MIR pass that lowers become to LLVM’s musttail needs to be solid across all optimization levels and target architectures, and there are edge cases involving generators and certain cross-crate ABI scenarios that need resolution first.
Matt Keeter’s post contributes to the body of practical experience with become that is accumulating as more authors try it on nightly. His Fidget library evaluates implicit surface functions over large grids, needing a fast, portable interpreter tier that works on targets where the Cranelift JIT is unavailable, including WebAssembly targets. The tail-call dispatch technique fits that requirement well: each expression opcode is a function, each ends with become dispatch(...), and the result is a portable interpreter that avoids both the stack growth problem and the per-dispatch overhead of a central switch loop.
Practical use cases like Fidget are how nightly features build confidence. When enough real projects report correct behavior under workloads that stress the feature, the path to stable becomes more direct. The dispatch performance gap between switch loops and computed-goto that C has exploited since CPython adopted it in 2009 is now closeable in safe Rust, at the cost of a nightly toolchain for the moment.