Breaking the call_indirect Wall: V8's Speculative Inlining for WebAssembly
Source: v8
Back in June 2025, the V8 team published a post describing two optimizations that had just shipped in Chrome M137: speculative call_indirect inlining and deoptimization support for WebAssembly. Looking back at it now, the engineering behind these features represents a meaningful shift in how V8 approaches WebAssembly performance, and the architectural constraints they had to solve are worth understanding in depth.
The call_indirect Problem
WebAssembly functions call each other through two mechanisms. Direct calls use call $func, where the compiler knows the target at compile time and can, in principle, inline the callee. Indirect calls use call_indirect, which goes through a function table indexed at runtime. You do not know at compile time which function sits at index 42.
This is semantically equivalent to a C function pointer call, a C++ virtual dispatch, or an interface method call in Java. The pattern appears constantly in real WebAssembly programs because most high-level languages compile to it. C++ vtables, Rust trait objects, Go interfaces, and any dynamic dispatch mechanism in the source language map to some form of call_indirect in the compiled output.
Consider a simple C++ virtual dispatch:
class Renderer {
public:
virtual void draw(Frame& f) = 0;
};
void render_loop(Renderer* r, std::vector<Frame>& frames) {
for (auto& f : frames) {
r->draw(f); // virtual call, becomes call_indirect in Wasm
}
}
When compiled to WebAssembly via Emscripten, that r->draw(f) becomes roughly:
;; Call whatever function is at index i in the table,
;; expecting signature (i32 i32) -> void
(call_indirect (type $draw_sig)
(local.get $frame_ptr)
(i32.load (local.get $vtable_idx)))
The optimizer sees this and, without profile data, can do almost nothing. It cannot inline the callee, propagate constants across the call boundary, or eliminate dead stores the callee might read. Every call_indirect is a wall the optimizer cannot see through. The function body on the other side is opaque.
Speculative Inlining: The Profile-Guided Answer
The key insight behind speculative inlining is that most call sites, even dynamic ones, are monomorphic in practice. A vtable slot that dispatches to the same implementation in 99% of executions is more common than not, especially in generated code from statically typed languages. V8 collects call target feedback during execution in its baseline tier (Liftoff for WebAssembly), then uses that feedback when promoting hot functions to the optimizing compiler (Turboshaft).
The structure of the optimized code for a speculative inline looks roughly like:
if (table[i] == expected_function) {
// inlined body of expected_function
} else {
// slow path: actual call_indirect + potential deopt
}
By inlining the expected callee, the optimizer can now see through the call boundary. Constant propagation, alias analysis, dead code elimination, and loop optimizations all become possible across what was previously an opaque barrier. For the render_loop example above, if the loop always calls the same Renderer subclass’s draw method, V8 can now treat the loop body holistically, potentially enabling SIMD vectorization that the call_indirect barrier was blocking entirely.
For polymorphic call sites where two or three distinct callees appear with meaningful frequency, V8 generates a chain of such guards. Beyond a threshold, it falls back to the generic indirect call path.
Deoptimization: The Infrastructure Challenge
Adding deoptimization to WebAssembly is harder than it sounds, and this is where the real engineering work is concentrated. In JavaScript, deopts have been part of V8’s architecture since the Crankshaft era, refined through TurboFan, and carried into Turboshaft and Maglev. The mechanism works by mapping every point in optimized code back to a corresponding position in the unoptimized baseline, then reconstructing that state (local variables, stack values, program counter) when a deopt fires.
WebAssembly requires similar infrastructure. When the speculative guard fails because the actual callee is not the expected one, the engine needs to abandon the optimized frame and resume execution through the Liftoff baseline tier. The optimized Wasm code must carry enough metadata to reconstruct a Liftoff-compatible frame: every local variable value, the operand stack contents, and the current bytecode position.
This is complicated by the transformations Turboshaft applies. Registers may hold values that do not correspond one-to-one with Wasm locals. Intermediate computations may exist only in registers with no Wasm-level counterpart. Operations may have been hoisted out of loops or sunk into branches, breaking the correspondence with original instruction positions. The deopt mechanism must undo all of this, a process called materialization.
V8 generates deopt tables alongside optimized code, with one entry per possible deopt point. Each entry describes how to reconstruct each Wasm local and stack slot from whatever state the optimized code holds at that moment. On a deopt, the runtime walks this table and constructs the Liftoff frame in memory before resuming there.
WebAssembly’s abstract machine specification, with its explicit locals, typed value stack, and structured control flow, gives this process a well-defined reconstruction target. The same property that makes WebAssembly predictable and portable is what makes deoptimization tractable: the spec defines exactly what state needs to be restored.
The Same Problem, Solved Before
V8 has been doing speculative inlining for JavaScript since before WebAssembly existed. The Java HotSpot JVM has had speculative devirtualization since the early 2000s, using inline caches at call sites to track which implementation appears most frequently and inlining it speculatively. The design pattern is well understood across the industry: collect type feedback at call sites, specialize the hot path, guard the specialization, deoptimize on guard failure.
The challenge is that each runtime has its own abstract machine, and deoptimization infrastructure is tightly coupled to that abstract machine’s representation. You cannot reuse JavaScript’s deopt tables for WebAssembly directly. The stack frame layouts differ, the local variable models differ, and the relationship between optimized code and source bytecode differs in each case. V8’s team had to build a parallel deopt system for Wasm that mirrors the JavaScript one in structure but targets Liftoff’s frame format rather than the interpreter’s.
SpiderMonkey, Firefox’s engine, has pursued somewhat different approaches to WebAssembly optimization. Mozilla’s Ion compiler for JavaScript has its own speculative optimization machinery, and the Wasm optimization tier has taken a different path at times. Each engine makes independent bets about where the profitable optimization work is, which is useful for the broader ecosystem even when the approaches diverge.
Performance Implications in Practice
The impact of speculative call_indirect inlining depends heavily on workload shape. Programs compiled from C++ with heavy virtual dispatch, Rust programs using trait objects at hot call sites, or any language runtime compiled to WebAssembly that uses function tables for dynamic dispatch all stand to benefit. The improvement is not just call overhead elimination; inlining unlocks subsequent optimizations that were invisible before.
Emscripten-compiled C++ programs are a clear example. A C++ class hierarchy where virtual methods are called in a tight loop will generate call_indirect instructions at each dispatch site. If the same derived class handles those calls most of the time, speculative inlining collapses the dispatch and lets the compiler treat the loop body as a unit. Alias analysis can eliminate redundant loads, escape analysis can eliminate heap allocations, and the auto-vectorizer gets a chance at a body it could not previously see.
Server-side WebAssembly runtimes, running business logic or plugin sandboxes, tend to have more polymorphic call sites where the callee varies more frequently. For those cases, the speculative approach is less universally effective. The fallback path behavior matters, and V8’s implementation needs to handle multi-target call sites without making the common case pay for the uncommon case’s complexity.
What Changes About WebAssembly’s Performance Model
WebAssembly’s original pitch included predictable performance. The bytecode had a defined cost model, and you could reason about it without understanding a JIT compiler’s internal state. Adding deoptimization support means some WebAssembly code will now exhibit classic JIT behavior: fast after warm-up, potentially slower during warm-up, with occasional deopt events followed by recompilation at the next tier transition.
For most workloads this is the right trade. A speculation that holds 99% of the time generates much better average-case throughput than no speculation at all, and the deopt path fires rarely by design. The practical result is faster programs, not less predictable ones in any meaningful sense.
But it does mark a shift in what WebAssembly runtimes are allowed to be. The line between WebAssembly and JavaScript, from a compiler architecture perspective, is blurrier than it was when the WebAssembly MVP shipped in 2017. Both are now subjects of the same speculative optimization machinery, with shared infrastructure for feedback collection, specialization, and recovery from wrong predictions.
For anyone reasoning about WebAssembly performance in production, the practical implication is this: the first few seconds of execution may produce different performance characteristics than the steady state, and monomorphic call sites will see the largest gains once the feedback has stabilized. Profiling Wasm workloads after warm-up gives a more accurate picture of the ceiling. That is a familiar story if you have spent time tuning JVM or V8 JavaScript workloads; it is newer territory for WebAssembly.