Building a fast interpreter for a dynamic language is one of those problems where the obvious approach works until you care about performance, and then you have to care about everything at once.
The Zef language implementation notes walk through the decisions behind their interpreter in useful detail, and they make a good lens for understanding why this problem is harder than it looks. Most of the techniques that make dynamic language interpreters fast have been known for decades, but assembling them correctly requires understanding how they interact.
Value Representation: The Foundation
Before you dispatch a single instruction, you need to decide how values are stored in memory. This decision ripples through everything else.
The naive approach is a tagged union: a type tag plus a union of possible payloads. This works, but it wastes space (typically 16 bytes per value on 64-bit systems due to alignment) and incurs a branch on nearly every operation.
The two dominant alternatives are NaN boxing and pointer tagging.
NaN boxing exploits the IEEE 754 double-precision float format. A 64-bit double has 11 exponent bits; when all of them are set to 1 and the mantissa is nonzero, you have a NaN. There are 2^52 - 1 possible NaN bit patterns, giving you 51 bits of payload. On 64-bit systems, pointers only use 48 bits on most platforms, so you can store any heap pointer inside a NaN payload. Floats are stored as-is. Integers, booleans, null, and undefined get their own reserved NaN patterns.
This is the technique used by LuaJIT, JavaScriptCore, and historically SpiderMonkey. The payoff is that every value fits in one 64-bit word, comparisons are cheap, and you can pass values in registers.
Pointer tagging takes the opposite approach: steal the low bits of a pointer. Because most heap objects are at least 4 or 8-byte aligned, the low 2 or 3 bits of any pointer are always zero. You can encode small integers, booleans, and other immediates directly into those bits. This is what Ruby uses, what many Smalltalk implementations have used since the 1970s, and what CPython leans on for object identity.
The tradeoff is mostly about what your language’s most common value types are. If floating-point is central, NaN boxing avoids heap allocation for every float. If integers dominate, pointer tagging for small ints is very cache-friendly.
Bytecode: Register-Based vs. Stack-Based
Once you know how values are represented, you need an instruction set.
Stack-based VMs (the JVM, CPython through 3.10, the original Lua) push operands onto a value stack and pop them to produce results. They tend to produce more instructions per operation but each instruction is smaller and simpler to decode.
Register-based VMs (Lua 5+, Dalvik before ART, the Erlang BEAM) embed source and destination register indices in each instruction. They produce fewer instructions at the cost of wider encodings. A 2005 study by Shi et al. showed register-based VMs executing 47% fewer instructions than equivalent stack-based VMs, though the effect on actual throughput depends heavily on the workload.
Lua’s switch from stack-based in version 4 to register-based in version 5 is the canonical case study. Lua 5 instructions are 32 bits wide: 6 bits for opcode, 8 bits for register A, then either 18 bits for a signed integer (for jump instructions) or two 9-bit fields B and C for source operands. This encodes most common operations in a single word with no indirection.
-- Lua 5.4 instruction format (32 bits)
-- ADD A B C: R[A] = R[B] + R[C]
-- LOADI A sBx: R[A] = sBx (signed integer literal)
-- JMP sJ: PC += sJ + 1
The wider instruction format pays off because you eliminate a lot of push/pop traffic. The main cost is that your bytecode emitter needs to do register allocation, which is harder than simply generating stack instructions.
Dispatch: The Hot Loop
The interpreter’s main loop is where most time is spent. You have a decoded instruction and need to jump to the handler for that opcode. How you do this matters more than most people expect.
A switch statement is the portable baseline:
while (1) {
uint32_t instr = *pc++;
switch (OPCODE(instr)) {
case OP_ADD: /* ... */ break;
case OP_LOAD: /* ... */ break;
}
}
The problem is that the compiler typically compiles this into a single indirect jump through a table, then loops back to the top. Every instruction executes two jumps: one to the handler, one back to the switch. The CPU’s branch predictor has to predict the same indirect jump every time, which it cannot do well because the target changes with every instruction.
Computed goto (a GCC/Clang extension) fixes this by placing a separate indirect jump at the end of each handler:
static const void* dispatch_table[] = {
&&op_add, &&op_load, /* ... */
};
#define DISPATCH() goto *dispatch_table[OPCODE(*pc++)]
DISPATCH();
op_add:
/* handle ADD */
DISPATCH();
op_load:
/* handle LOAD */
DISPATCH();
Now each jump is at a unique program counter location, so the branch predictor can learn to predict “after ADD usually comes LOAD” based on the actual trace of the program. CPython added computed goto dispatch in Python 3.2 and measured a 15-20% speedup on the benchmarks available at the time. Lua’s VM has used this technique for years.
The even faster variant is direct threading, where the bytecode itself contains function pointers rather than opcode integers. You lose compactness but gain a fully predictable dispatch at every instruction. This is the technique used by Forth systems and some Smalltalk implementations.
One important caveat: computed goto is not portable C. MSVC does not support it, and WebAssembly targets often cannot use it either. If portability matters, you are stuck with switch, and you have to hope the compiler does something smart.
Inline Caching
Dynamic dispatch, property lookup, and method resolution are the real performance killers in dynamic languages. In JavaScript, obj.foo might mean a property on the object, a property on the prototype chain, a getter, or a missing property that returns undefined. Doing a full lookup on every access is catastrophic for throughput.
Inline caching (IC) is the standard fix. At each call site in the bytecode, you cache the result of the last lookup. When the same shape (or “hidden class”) of object appears again, you skip the lookup entirely. This is the technique that made Self and early V8 fast, and it is documented in detail in the original Self paper by Chambers and Ungar (1989).
A monomorphic inline cache handles one shape. A polymorphic cache handles a small fixed number. Megamorphic call sites get no caching at all. The distribution of shapes at any given call site tends to be heavily skewed in real programs, so even a monomorphic cache helps enormously.
CPython 3.11 introduced adaptive specializing interpretation, which rewrites bytecode instructions in-place based on observed types. LOAD_ATTR becomes LOAD_ATTR_MODULE or LOAD_ATTR_INSTANCE_VALUE depending on what it sees. This delivered a reported 10-60% speedup across the CPython benchmark suite without a full JIT compiler.
String Interning
Dynamic languages spend a surprising amount of time on string comparisons, because property names are strings. If you can guarantee that equal strings are the same pointer (interning), then property name comparison becomes a pointer comparison, which is a single machine instruction.
Most production interpreters intern all strings that look like identifiers. Lua interns all strings by default using a global hash table. V8 uses a more selective approach, interning strings used as property keys but not necessarily all strings.
The tradeoff is that interning has a cost: every string creation must check the intern table, and the table itself takes memory. For short strings used repeatedly as identifiers, this is overwhelmingly worth it. For long strings or strings that are rarely compared by identity, less so.
How the Pieces Interact
These techniques are individually well-understood. The challenge is that they interact in non-obvious ways. NaN boxing constrains your garbage collector design, because the collector needs to distinguish heap pointers from NaN-encoded immediates without a separate type tag. Register-based bytecode makes your parser and code generator more complex. Computed goto is not portable C. Inline caches require a notion of object shape, which has to be designed in from the beginning rather than bolted on later.
What makes reading implementation notes like Zef’s valuable is seeing how someone weighed these tradeoffs for a specific language design and target environment. There is no universally correct combination. Lua’s choices are right for Lua. V8’s choices are right for V8. A language with different semantics or different performance priorities will land differently.
If you want to go deeper, Crafting Interpreters by Robert Nystrom covers a complete bytecode VM from scratch in C, including a mark-sweep GC. The CPython Developer Guide has extensive documentation on the adaptive interpreter added in 3.11. And Mike Pall’s mailing list posts on LuaJIT internals remain one of the best compressed explanations of what makes a tight interpreter loop actually tight.