· 6 min read ·

WebAssembly Isn't a Stack Machine, and That's the Whole Point

Source: eli-bendersky

There’s a recurring argument in language-design circles about whether WebAssembly really qualifies as a stack machine. The latest round was kicked off by a post titled “Wasm is not quite a stack machine,” and Eli Bendersky followed up with his own thoughts pointing out that the disagreement is largely semantic: there’s no canonical definition of stack machine, and WASM clearly uses a stack as its primary interaction model even if it bolts an infinite register file onto the side.

I think the more interesting question isn’t whether WASM clears some purity bar. It’s why the designers chose this hybrid shape in the first place, and what it costs them. The short answer: WASM was never trying to be Forth. It was trying to be a compact, fast-to-validate compiler target for languages that look nothing like Forth, and the locals are load-bearing for that goal.

What the spec actually says

The WebAssembly Core Specification describes execution in terms of a stack of values and a stack of control frames. Instructions like i32.add pop two values and push one. So far this is textbook stack-machine behavior, the same shape you’d see in the JVM or CPython’s bytecode interpreter.

Where it diverges from a pure stack machine is the local variable space. Every function declares a fixed number of locals (including parameters), and you access them via local.get $i, local.set $i, and local.tee $i. These are indexed slots, not stack positions. There is no dup, no swap, no rot, no pick. If you want to use a value twice, you either compute it twice, or you stash it in a local and read it back.

This is the heart of the complaint. A purist would say the stack should be sufficient, and the existence of an unbounded indexed register file undermines the claim. A pragmatist would shrug and ask why it matters.

Why no dup and swap

The absence of generic stack-manipulation instructions isn’t an oversight. It comes out of two constraints that shaped WASM from the beginning: validation and single-pass compilation.

WASM modules must be statically validated before execution. The validator walks the instruction stream once and checks that the operand stack has the expected types at every point. This is what lets browsers compile WASM to native code quickly and safely. The validation algorithm is published in the spec as a reference, and it’s intentionally simple: a small state machine tracking a type stack and a control stack.

With only typed pops and pushes from instructions whose signatures are fixed, the validator’s job stays trivial. Add dup and you’ve added an instruction whose output type depends on whatever’s currently on top of the stack. That’s still tractable, but it complicates every downstream tool that does type inference, including streaming compilers that want to commit to a register allocation before they’ve seen the rest of the function. Add pick n and you’re now reasoning about arbitrary stack depths during validation.

Locals sidestep this entirely. Each local has a declared type, fixed for the lifetime of the function. The compiler in the engine can map each local to a machine register or stack slot once, up front, and never revisit the decision. Browser engines like V8’s Liftoff baseline compiler depend on this property to emit code in a single pass with predictable register pressure.

The compiler-target argument

The other half of the answer is that WASM is not handwritten. The design rationale is explicit about this: the source languages are C, C++, Rust, Go, and the like, all of which have SSA-based intermediate representations in their compilers. None of them naturally emit Forth-style stack juggling. They emit something like LLVM IR, where every value has a name and every use refers to that name.

Lowering SSA to a stack machine with locals is mechanical. Each SSA value becomes either a stack push or a write to a local, depending on whether it has one use or many. Values with a single use can stay on the stack; values with multiple uses get spilled to a local and read back as needed. That’s basically what Binaryen and LLVM’s WASM backend do, and the stack ends up being a peephole optimization rather than the programmer’s working memory.

Lowering SSA to a pure stack machine with dup and swap is harder. You’d need to schedule values so that they appear on the stack in the right order at the right depth, and that ordering problem gets ugly fast for any nontrivial expression. The classic paper on this is Koopman’s “A Preliminary Exploration of Optimized Stack Code Generation”, which catalogs the heuristics required.

Locals are the escape hatch that makes the lowering tractable. They’re also why WASM code generated by LLVM tends to look like an SSA dump with a thin stack veneer rather than idiomatic Forth.

How it compares to the JVM and CPython

The JVM made the same compromise three decades earlier. The JVM specification defines an operand stack per frame, plus a local variable array addressed by iload, istore, aload, and friends. There are stack manipulation instructions in the JVM, dup, dup_x1, swap, pop2, but they’re used sparingly by javac and primarily exist to handle expression evaluation patterns like leaving a copy of an object reference on the stack for a chained method call.

CPython’s bytecode is closer to a pure stack machine. It has DUP_TOP, ROT_TWO, ROT_THREE and friends, and the compiler uses them freely because Python’s evaluation model is dynamically typed and the interpreter doesn’t try to do single-pass validation in the same way a WASM engine does. In Python 3.11 a lot of this was cleaned up with the adaptive interpreter, but the underlying ISA is still recognizably stack-flavored.

WASM picked a point further along the spectrum toward register machines. Locals give it the SSA-friendliness of a register IR while keeping the encoding compact and the validator simple. The cost is that hand-written WASM reads like a slightly awkward assembly language rather than a meditation on concatenative programming.

What you give up

The Forth example Bendersky cites is instructive. The +! word is six tokens and would be roughly six WASM instructions, but the WASM version names its addresses and values explicitly:

(func (export "add_to_addr") (param $addr i32) (param $delta i32)
  (i32.store
    (local.get $addr)
    (i32.add
      (i32.load (local.get $addr))
      (local.get $delta))))

The WASM version is longer in tokens but trivially easier to read, because every value has a name. Forth’s terseness comes from the implicit data flow through the stack, and that terseness is exactly what makes nontrivial Forth code hard to follow without stack-effect comments. WASM trades the brevity for legibility, and more importantly, for a structure that compiles cleanly from upstream IRs.

The interesting wrinkle is that WASM’s text format (.wat) lets you write expressions in folded S-expression style, which hides the stack entirely. The flat instruction sequence underneath still uses the stack, but no human ever has to track depth manually. That’s a tell: even WASM’s designers think the stack is an implementation detail, not a programming model.

The honest framing

Calling WASM a stack machine is fine as shorthand. It’s accurate enough that the reference interpreter is genuinely stack-based, and the binary encoding really is a postfix instruction stream. But the design pressure that produced WASM came from compiler engineering, not from the lineage that produced Forth and PostScript. The stack is there because it gives a compact, easy-to-validate encoding; the locals are there because real source languages need them; the absence of dup and swap is there because nobody writing a C compiler wants to emit them.

The debate over purity is a fun parlor game, but the practical takeaway is that WASM is a deliberate hybrid, and the hybrid is the whole reason it ships fast bytecode in every browser. A pure stack machine would have been more elegant and harder to use. A pure register machine would have been bulkier on the wire. The middle path is what you get when the goal is a portable target for existing toolchains, and the existing toolchains all speak SSA.

Was this interesting?