· 8 min read ·

How V8 Taught WebAssembly to Guess and Recover

Source: v8

Back in June 2025, the V8 team published a post about two new WebAssembly optimizations that shipped in Chrome M137: speculative call_indirect inlining, and deoptimization support for Wasm. The announcement is technically dense but undersells the harder half of the work. The inlining is the payoff. The deoptimization infrastructure is the foundation that made it possible, and Wasm had none of it before.

Let me explain what that means and why it matters.

What call_indirect Actually Does

WebAssembly does not have function pointers in the C sense. Instead, it has tables: typed arrays of function references. The call_indirect instruction takes a runtime integer index, looks up the function at that position in the table, checks that the function’s type signature matches the expected one, and calls it. If the bounds check fails, if the slot is null, or if the type check fails, the instruction traps.

This is how Emscripten and wasi-sdk implement virtually everything that requires dynamic dispatch in C and C++. Virtual method calls compile to vtable lookups, which compile to call_indirect. Function pointers compile to call_indirect. Callbacks passed through the C ABI compile to call_indirect. Any non-trivial Wasm binary produced from a real C++ codebase is saturated with them.

The full dispatch sequence in machine code, before any optimization, looks roughly like this:

; Load table base
mov  rax, [wasm_instance + table_offset]
; Bounds check on the index
cmp  index, [rax + table_length_offset]
jae  trap_out_of_bounds
; Load the table entry
mov  rbx, [rax + index * 8 + elements_offset]
; Null check
test rbx, rbx
jz   trap_null_dereference
; Type signature check
mov  rcx, [rbx + canonical_sig_offset]
cmp  rcx, expected_sig_id
jne  trap_type_mismatch
; Actual indirect call
call [rbx + code_entry_offset]

Five memory loads. Three conditional traps. An indirect branch through a pointer. For code that calls the same virtual method in a tight loop, this is a significant cost, and the indirect branch is particularly hostile to the CPU’s branch predictor and the instruction cache.

The Speculative Answer

The observation that makes speculative inlining possible is straightforward: in real-world Wasm workloads, many call_indirect sites are monomorphic. A given virtual call site in a hot path may consistently dispatch to the same concrete function at runtime, even though the language spec permits any valid entry in the table.

V8’s Liftoff compiler, the baseline tier that handles every Wasm function on first execution, collects feedback at call_indirect sites. It records which function index was observed at each site. When Turboshaft, the optimizing compiler, compiles a hot function, it reads this feedback and classifies each call_indirect site:

  • Uninitialized: no calls seen yet, emit the generic sequence.
  • Monomorphic: one callee observed, emit a speculative inline.
  • Polymorphic: two to several callees observed, emit an inline cache with one guard per callee.
  • Megamorphic: too many callees to track, fall back to the generic sequence.

For the monomorphic case, Turboshaft replaces the entire dispatch sequence with:

; Compare runtime index against the profiled value
cmp  index, known_callee_index
jne  deopt_stub
; [inlined body of the callee follows]

Every check collapses into a single integer comparison. The bounds check is subsumed by it. The null check is unnecessary because the known function is not null. The type check is unnecessary because the callee’s type was verified at compile time. The indirect call disappears entirely because the callee’s body is now inline.

That last point compounds with everything downstream. An inlined callee exposes its operations to the caller’s optimizer. Load elimination, constant propagation, and vectorization can now work across what was previously an opaque call boundary. The win is often larger than the eliminated dispatch sequence suggests.

For sites within the inlining budget but too large to inline fully, Turboshaft can still devirtualize without inlining: replace the indirect branch with a guarded direct call. Cheaper than the full sequence, even if the callee body remains separate.

Deoptimization: The Part That Didn’t Exist

Speculative inlining is only sound if the speculation can fail safely. When the runtime index at a monomorphic site turns out to be different from the profiled value, the code has taken a branch to a deopt stub. That stub must reconstruct correct program state and continue execution on a path that will actually handle the unexpected case.

For JavaScript, this is old news. V8 has had deoptimization support for JS since the early Crankshaft days, refined through Turbofan, and now in Turboshaft. When a JS speculation fails, the deoptimizer reconstructs an Ignition interpreter frame and resumes there. The infrastructure is mature, battle-tested, and present in every V8 release.

For WebAssembly, it did not exist. Wasm was treated as AOT-like: once Turboshaft compiled a function, it ran to completion with no fallback path. Speculative optimizations that could fail were impossible, not because of any fundamental limit in the compiler, but because there was nowhere to deopt to.

Building that infrastructure is the bulk of the engineering work in M137.

What Wasm deopt required

Every speculative check in Turboshaft Wasm code now carries a deopt entry. The deopt entry encodes: the Wasm bytecode offset corresponding to the failing instruction, a complete mapping of all live Wasm values from their current locations (machine registers, stack slots, or compile-time constants) to their positions on the Wasm value stack, and enough outer-frame information to reconstruct inlined frames if the failing site is inside an inlined callee.

When a deopt fires, the deoptimizer uses this metadata to reconstruct a valid Liftoff frame. Liftoff, the baseline compiler, serves as the deopt target here. In JavaScript, the target is the Ignition bytecode interpreter. In Wasm, Liftoff code is the closest equivalent: it is always compiled first, it is kept available, and it handles all valid Wasm inputs correctly.

Reconstructing a Liftoff frame from a Turboshaft frame is not trivial. Turboshaft may have reordered operations, eliminated redundant loads, and allocated values to different registers than Liftoff would use. The deopt metadata must be precise enough to reverse all of that. And if the failing site is inside an inlined callee that was itself executing, the deoptimizer must reconstruct multiple frames: the caller’s Liftoff frame and the callee’s.

After the deopt, the function runs in Liftoff until it becomes hot again. On re-entry into Turboshaft compilation, the feedback now includes the index that triggered the deopt. Turboshaft can make a better decision: go polymorphic if there are two predictable callees, or give up speculation for that site if the pattern is truly unpredictable.

The code size cost

Deopt metadata has a size. Every speculation point requires attached records. The V8 team limits this by only placing deopt points at call_indirect sites and similar speculative checks, rather than throughout the entire function as in the JavaScript pipeline. The estimated overhead is in the low single-digit percentage range for code size. That is acceptable, but it explains why you would not want to spray deopt points indiscriminately through Wasm code the way JS pipelines do.

How Other Runtimes Compare

The landscape is worth surveying briefly.

Wasmtime, used in server-side and edge deployments, uses Cranelift as its compiler backend. It can devirtualize call_indirect statically through whole-module analysis, or through an offline profile-guided optimization cycle where you profile a workload, then recompile. It does not do in-session adaptive speculation. That is the right tradeoff for a server runtime where you can control the compilation lifecycle, but a browser JIT must handle arbitrary Wasm modules without a preparation step.

SpiderMonkey in Firefox has a tiered Wasm pipeline with Cranelift as the baseline and an optimizing tier on top. It has had inline caching for call dispatch for some time, but aggressive speculative call_indirect inlining with full deopt support has lagged behind what V8 shipped in M137. The JS deopt path in SpiderMonkey is mature; the Wasm equivalent is less so.

GraalWasm, built on the Truffle framework, has had speculative inlining with deoptimization for call_indirect for longer than V8. Truffle’s entire execution model is built around adaptive specialization, and every call_indirect site gets a polymorphic inline cache by default. The deopt infrastructure is shared with GraalVM’s JavaScript and other guest languages. The tradeoff is that GraalWasm’s peak throughput on compute-heavy Wasm is lower than V8’s Turboshaft, because Turboshaft generates tighter machine code.

WAMR, targeting embedded and IoT deployments, does not do adaptive speculation at all. The constraints are different: code size and startup latency matter more than peak throughput in those environments.

Table Mutation and Future Work

One complication that the V8 post touches on: Wasm tables are mutable. JavaScript code can call WebAssembly.Table.prototype.set() to replace a function at any index. If Turboshaft has speculatively inlined function X at a site because it always saw index 7 pointing to X, and then JavaScript replaces table[7] with function Y, the speculation is now wrong for every future call.

V8 must handle this by either invalidating and recompiling affected Turboshaft code when a table mutation occurs, or by restricting speculation to tables that have not been mutated after initialization. In practice, most real Wasm modules never mutate their function tables at runtime. The mutation capability exists primarily for dynamic linking use cases, but statically linked Emscripten binaries initialize their tables once and leave them alone.

Looking forward, the WebAssembly GC proposal introduces call_ref, which calls through a typed function reference rather than a table index. Function references carry more identity information than integer indices and are even more amenable to speculative inlining. The deopt infrastructure built for M137 directly enables that work. The call_indirect speculation is not the end state; it is the foundation.

What This Changes in Practice

The workloads that benefit most from M137 are OOP-heavy C++ compiled to Wasm: game engines, media codecs with pluggable implementations, serialization libraries, anything that uses virtual dispatch heavily. Numeric compute kernels with few or no indirect calls see no change.

For the affected workloads, the V8 team reported speedups in the 10 to 40 percent range on hot paths where inlining fires consistently. That range is wide because it depends entirely on what fraction of execution time was spent in call_indirect dispatch before. For a rendering engine where every frame calls dozens of virtual methods in tight loops, the upper end of that range is plausible.

The mechanism is not magic. It is the same adaptive speculation that JavaScript JITs have used for fifteen years, applied to a language where it was previously absent. The work is in the deoptimization infrastructure that makes speculation recoverable, and that infrastructure is what took the engineering effort. The performance numbers are the visible outcome of building something that was missing.

Was this interesting?