· 6 min read ·

What Actually Makes Wren Fast: Inside a Scripting VM Built for Embedding

Source: lobsters

Lua has dominated embedded scripting for decades. Game engines, config systems, plugin architectures, and network appliances all reach for Lua when they need a lightweight, fast, embeddable language. Wren, created by Bob Nystrom (who also wrote the excellent Crafting Interpreters), positions itself as a modern alternative in that same space. The performance page on wren.io shows benchmark results that hold up favorably against Python, Ruby, and standard PUC-Rio Lua. But the numbers are less interesting than the design decisions that produce them.

The Value Representation Problem

Every dynamic language runtime faces the same foundational question: how do you represent a value that might be a number, a string, a boolean, a null, or an object, without knowing its type at compile time?

The naive answer is a tagged union: a struct with a type enum alongside a union field. This is simple and works, but it means every value occupies at least 16 bytes on a 64-bit system (8 for the tag and padding, 8 for the union), and every operation requires branching on the tag before doing any real work.

Wren uses NaN boxing instead. This technique exploits the structure of IEEE 754 double-precision floating-point numbers. A double has one sign bit, 11 exponent bits, and 52 mantissa bits. When all 11 exponent bits are set to 1 and the mantissa is non-zero, the value is a NaN (not a number). There are roughly 2^52 distinct NaN bit patterns, and only a handful of them have any meaning in standard floating-point arithmetic. The rest are quiet NaNs waiting to be repurposed.

On 64-bit systems, user-space pointers only occupy 48 bits of address space (under current AMD64 conventions). A pointer fits cleanly inside the mantissa of a NaN. So Wren packs every non-double value, whether it’s a boolean, null, or heap pointer, into a NaN payload. When a value is a valid double (and not a NaN), it’s just a number. When it’s a NaN, the payload encodes both the type and the data.

The result is that every Wren value is exactly 8 bytes. No boxing, no heap allocation for small values, no separate type word. Arithmetic on doubles is a single branch to check whether the 64-bit word is a NaN or not. LuaJIT uses this same technique and its performance advantage over standard Lua is significant. PUC-Rio Lua uses a tagged union, which is why LuaJIT often runs several times faster on numeric benchmarks. Wren gets some of that advantage from day one, in the reference implementation, without needing a JIT compiler.

Compiling Without an AST

Most language implementations follow a familiar pipeline: lex tokens, build an abstract syntax tree, walk the AST to generate bytecode or IR, optionally run optimization passes, then execute. Wren skips the AST entirely.

The compiler is a single-pass, recursive-descent compiler that emits bytecode directly while parsing. There is no intermediate tree representation. This keeps the compiler small and fast to invoke, which matters when you’re embedding a language in a host application and may be loading scripts frequently.

The tradeoff is that certain optimizations become impossible or much harder. You cannot do dead code elimination, constant folding across expressions, or any analysis that requires seeing the full program before emitting code. Wren accepts this constraint. The performance comes from the runtime, not the compiler. Every cycle saved in the compiler is a cycle that doesn’t complicate the implementation or increase binary size.

The entire Wren source is roughly 4,000 lines of C. That is a meaningful constraint. It fits in your head, it compiles in seconds, and it can be audited. If you are embedding a scripting language in a game or a tool, you often care about that kind of surface area.

The VM: Stack-Based, Bytecode-Driven

Wren’s VM is a classic stack-based bytecode interpreter. Instructions operate on a value stack, pushing and popping 8-byte NaN-boxed values. The bytecode instruction set is compact and covers the typical operations: load/store locals and upvalues, arithmetic, comparisons, method calls, jumps, and a handful of others.

Method dispatch is where class-based languages live or die on performance benchmarks. When you write myObject.doThing(), the VM needs to look up the method on the object’s class. Wren uses a flat array indexed by a numeric method signature hash. Each class stores an array of method entries, and a method call is an array lookup after hashing the selector. This is faster than a hash table probe in the average case because the array avoids pointer chasing into a separate hash structure.

The benchmarks on the Wren performance page cover microbenchmarks: Fibonacci, loop counting, method call overhead, binary trees. These are the standard suite you see across language benchmark comparisons. Wren competes well with standard Lua on these tasks, runs clearly ahead of CPython on numeric-heavy workloads, and trails LuaJIT by a significant margin (which is expected; LuaJIT is one of the most impressive pieces of software in the embedded scripting space and traces hot loops to native code).

What the Benchmarks Don’t Tell You

Microbenchmarks measure what they measure. Fibonacci exercises integer arithmetic and recursive call overhead. Binary trees exercises allocation and garbage collection. These are real costs, but they don’t capture what an embedded scripting language typically does: calling back into host C functions, marshaling values across the language boundary, and running moderately complex logic that mixes string handling, object traversal, and occasional math.

Wren’s C API is clean and low-overhead. You can call Wren functions from C and C functions from Wren with minimal friction. The embedding story is one of Wren’s genuine strengths: you do not need to manage a separate state machine or fight with a complicated FFI. The embedding guide shows how little ceremony is involved in getting a WrenVM instance running, registering foreign methods, and invoking scripts.

For a game engine scenario where scripts run every frame and call into C subsystems for physics queries or audio triggers, the boundary crossing cost matters more than raw Fibonacci throughput. Wren’s design keeps that boundary thin.

Comparing the Landscape

If you’re choosing an embedded scripting language in 2026, the realistic options are Lua, LuaJIT, Wren, and depending on constraints, AngelScript, Squirrel, or mruby.

LuaJIT is the performance king for numeric and loop-heavy workloads, but it is a substantial commitment. The codebase is famously complex, active maintenance has been limited, and the JIT does not target all platforms (ARM64 support was incomplete for years). PUC-Rio Lua is simpler and more portable but slower.

Wren offers class-based OOP instead of Lua’s prototype-via-tables model, which some people find more natural, especially when the host application is also object-oriented. The fiber-based concurrency model lets scripts yield and resume without OS threads, which is useful for coroutine-style game logic. The syntax is clean and modern without being heavy.

Mruby (embeddable Ruby) has a much larger standard library and a more expressive language, but it comes with more weight and typically runs slower on the microbenchmarks that matter for tight scripting loops.

The Point of the Performance Page

Nystrom built Wren to prove that a clean, class-based scripting language designed for embedding did not have to sacrifice performance to get there. The performance page exists as a statement: you are not choosing between good design and good speed.

The techniques are not novel. NaN boxing predates Wren by decades, single-pass compilation is older still, and bytecode VMs are well-understood. What Wren demonstrates is that careful selection and combination of known techniques, applied with discipline and a strict constraint on codebase size, produces something competitive with the incumbent. No JIT, no heroics, no magic: just a well-engineered small VM.

If you’ve been reaching for Lua by default and have not looked at Wren, the performance page is a reasonable place to start. The source is on GitHub and reading it alongside Crafting Interpreters gives you a complete picture of both the what and the why.

Was this interesting?