· 6 min read ·

WebAssembly Isn't a Pure Stack Machine, and That's a Feature

Source: eli-bendersky

There’s a recurring debate in the compiler nerd corners of the internet about whether WebAssembly really qualifies as a stack machine. The latest round was kicked off by a post titled “Wasm is not quite a stack machine,” which Eli Bendersky responded to thoughtfully. His take is that the disagreement is largely semantic, since there’s no formal definition of what makes a stack machine pure. I think he’s right, but the more interesting question is why WebAssembly ended up looking the way it does, and what the design tells us about the gap between abstract VM models and the realities of running compiled code at near-native speed.

What WebAssembly actually looks like

A WebAssembly function is a sequence of instructions that operate on an implicit operand stack, plus a fixed set of typed locals declared in the function header. Locals include the function’s parameters and any additional slots the producer asked for. You read a local with local.get (which pushes its value onto the stack) and write one with local.set (which pops). There’s also local.tee, which writes the top of the stack to a local without popping, useful when you want to keep a value around for the next instruction.

Here’s the example Eli cites, in WebAssembly text format:

(func (export "add_to_byte") (param $addr i32) (param $delta i32)
  (i32.store8
    (local.get $addr)
    (i32.add
      (i32.load8_u (local.get $addr))
      (local.get $delta))))

The text format is folded into S-expressions for readability, but the underlying bytecode is flat: push addr, push addr, load8, push delta, add, store8. Each operation consumes its inputs from the stack and pushes its result, exactly the way a stack machine is supposed to work. What makes it not quite pure is that there are no dup, swap, over, rot, or pick instructions. If you need a value twice, you store it in a local and read it twice. The WebAssembly specification’s instruction list confirms this: parametric instructions are limited to drop and select, and that’s it for stack manipulation.

Compare this with Forth, where dup, swap, over, rot, pick, roll, tuck, nip, and a dozen more are the bread and butter. Eli’s earlier post on implementing Forth in Go and C shows exactly what programming in that style feels like, and the canonical +! definition with its stack-shape comments illustrates the cost: you cannot read Forth without mentally simulating the stack on every line.

Why the design choice matters

WebAssembly was never meant to be hand-written. It’s a compiler target, and the producers are LLVM, the Go compiler, Rust’s rustc, AssemblyScript, and so on. Those compilers already work in terms of SSA values and virtual registers. A pure stack representation would force them to invent stack-shuffling sequences that the engine then has to undo to produce good native code. By providing locals as a register file, the format meets compilers halfway: the compiler emits something close to its internal IR, and the engine has an easier time mapping locals to physical registers.

Andreas Rossberg and the original WebAssembly authors discuss this in the PLDI 2017 paper “Bringing the Web up to Speed with WebAssembly”. They point out that the stack-based encoding is chosen mainly for compactness; the validation algorithm and the type system are defined over the stack, but the execution model is not required to materialize the stack at runtime. Production engines like V8’s Liftoff and TurboFan, SpiderMonkey’s tiered compiler, and standalone runtimes like Wasmtime all translate the stack form into register-allocated machine code. The stack is a serialization format, not a runtime data structure.

This is why the missing dup and swap are not a hardship. If WebAssembly let you duplicate the top of the stack, the producer would happily emit it whenever its IR had two uses of a value. The engine would then have to recognize that pattern and recover the underlying SSA edge to allocate a register. Forcing the producer to spell out the dataflow through a named local makes the dependency explicit and removes a layer of analysis from the engine. The cost is two extra bytes per duplicated use, which the binary format mitigates with LEB128 encoding for local indices.

How other VMs handle the same problem

The JVM is a useful comparison. It’s also stack-based, also has locals, and also descends from a tradition that wanted bytecode to be small and verifiable. But it does include dup, dup_x1, dup2, swap, and a handful of variants. The reason is historical: javac was written to emit straightforward stack code without a register allocator, and these instructions let the compiler avoid touching the local table for short-lived intermediate values. Modern HotSpot still has to undo this in its C1 and C2 tiers. Cliff Click’s writing on HotSpot’s IR describes the lengths the JIT goes to in order to recover SSA from stack-shaped bytecode.

CPython’s bytecode is stack-based too and includes DUP_TOP (renamed COPY in 3.11) and SWAP. CPython doesn’t have a JIT in the mainline interpreter, so the stack is the runtime data structure, and these manipulation ops are cheap. Once you start JITing, as the Faster CPython work and projects like Cinder have, stack manipulation becomes the same kind of obstacle it is for the JVM.

The pattern is consistent. VMs that expect to be interpreted on a real stack benefit from rich stack instructions. VMs that expect to be JITed benefit from explicit dataflow through locals. WebAssembly committed to the latter from day one, which is why it can hit 80 to 95 percent of native performance on benchmarks like PolyBench/C, as reported in the original PLDI paper and corroborated by later studies such as “Not So Fast” at USENIX ATC 2019.

The validation angle

The other reason to keep stack manipulation minimal is verification. WebAssembly modules must pass a single-pass type-checking algorithm before execution. The validator simulates an abstract stack of types, pushing and popping as it walks the instruction stream, and rejects modules where the types don’t line up. Every additional stack instruction is another case in the validator and another opportunity for spec ambiguity. The JVM’s verifier famously had to be redesigned with stack maps in Java 6 because the original inference-based verifier was too slow and too subtle. WebAssembly’s designers had that history in front of them and chose a smaller instruction set partly to keep verification simple and fast.

The upcoming WebAssembly GC proposal and the exception handling proposal both extend the type system substantially, and the working group has been careful to keep the validation rules local and decidable. Adding dup would not break this, but adding pick n or roll n would, since the type at depth n depends on the entire prefix of the stack. The Forth-style stack vocabulary is fundamentally at odds with the kind of structural type checking WebAssembly relies on.

Where I land

Calling WebAssembly a stack machine is fine as shorthand. It uses a stack for operand passing, its instructions are encoded in postfix order, and its validation algorithm is defined over a typed stack. Calling it a pure stack machine is a stretch, but purity was never the goal. The format is a compact, verifiable, compiler-friendly serialization of dataflow, and the locals are there because real producers and real engines both want them. The absence of dup and swap is not an oversight; it’s the same instinct that led LLVM to forbid implicit stack effects in its IR. If you want a pure stack machine, Forth is still right there, and Eli’s implementation post is a good place to start. If you want a portable compile target that engines can turn into fast native code, the hybrid WebAssembly chose is the better trade.

Was this interesting?