Dispatch Loops, Value Representation, and Inline Caches: Engineering a Fast Dynamic Language
Source: lobsters
A recent article from the Zef language project walks through the implementation choices behind a performant dynamic language interpreter. It covers the same terrain that CPython, LuaJIT, and V8 have navigated over the years, and it is worth expanding on the three decisions that matter most: dispatch, value representation, and inline caches. Each one has a well-understood fastest option on modern hardware, and each constrains the others in ways that are not obvious until you are deep in the implementation.
The Dispatch Loop
Every interpreter has a core loop that reads an opcode, figures out what code to execute, and runs it. The textbook version is a switch statement:
while (1) {
uint8_t op = *ip++;
switch (op) {
case OP_ADD: /* ... */ break;
case OP_LOAD: /* ... */ break;
}
}
Compilers generate a jump table for this, which is reasonable. The problem is that after each handler runs, control returns to the top of the loop, then jumps again to the handler for the next opcode. That is two indirect jumps per instruction: one at the switch, one at the implicit branch back to while (1). Modern branch predictors handle direct branches well but struggle with indirect ones, especially when the opcode sequence is not predictable.
Direct threading, using computed goto (a GCC and Clang extension), eliminates the second jump by having each handler jump directly to the next:
static void* dispatch_table[] = { &&op_add, &&op_load, /* ... */ };
#define DISPATCH() goto *dispatch_table[*ip++]
DISPATCH();
op_add:
/* ... */
DISPATCH();
op_load:
/* ... */
DISPATCH();
Each DISPATCH() jumps directly to the next instruction’s handler, so the CPU only has to predict one indirect jump per instruction rather than two. CPython 3.11 adopted this approach alongside an adaptive specializing interpreter where frequently executed instructions are overwritten in place with type-specialized versions. The result was a 25% speedup on the pyperformance benchmark suite without touching a JIT tier.
Lua has used computed goto in its VM for years. The lvm.c source for Lua 5.4 falls back to a switch when computed goto is unavailable, but the goto path is always preferred when the compiler supports it. LuaJIT goes further: its interpreter is written in hand-tuned assembly via DynASM, and the interpreter alone is fast enough to beat many JIT-compiled implementations on common workloads. The dispatch loop is where LuaJIT’s performance story begins, not where it ends.
Value Representation
A dynamic language needs to track the type of every value at runtime. The question is where that type information lives and how cheaply the interpreter can check it.
The straightforward approach is a tagged struct: a type tag alongside a union of possible values. This is easy to implement and works on every platform, but it means values rarely fit in a single register, and every operation that moves values costs more memory traffic. CPython uses an even heavier version of this: every value is a heap-allocated PyObject with a type pointer at its head, reference count fields, and a union for the actual data. Every integer addition involves reference counting, pointer chasing, and type dispatch through a C struct of function pointers.
NaN boxing is the alternative used by JavaScriptCore, Wren, and historically by SpiderMonkey. It exploits a property of IEEE 754 doubles: a 64-bit float is a NaN whenever its 11 exponent bits are all set and the mantissa is nonzero. That leaves 52 mantissa bits plus the sign bit free when the value is a NaN. A valid floating-point number is stored as-is. A pointer, integer, boolean, or null is stored as a NaN with a type tag in the high mantissa bits and the payload in the low bits:
63 52 51 48 47 0
[ exp: 1s ][ tag ][ payload ]
type pointer or small integer value
Wren’s value.h is a compact, readable implementation of this, clocking in under 200 lines of C. Type checks become a bitmask and a comparison, all in registers, with no heap dereferences. On a 64-bit platform with plenty of pointer alignment guarantees, you can pack a surprising amount of the interpreter’s hot data into this representation.
Pointer tagging is simpler and works well when you do not need 64-bit floats as first-class values. On any platform where heap allocations are 8-byte aligned, the bottom three bits of a pointer are always zero. Ruby’s MRI runtime tags small integers by setting the low bit, distinguishing fixnums from object pointers with a single AND instruction. Arithmetic on small integers costs nothing in allocation, which matters because integer arithmetic is common in real programs.
The choice between these schemes is not just about integer performance. NaN boxing gives you doubles for free but makes true 64-bit integer representation awkward, since you only have 50 or so bits for the payload. Pointer tagging is simple but forces doubles to the heap. Tagged structs avoid both problems but cost register width and memory bandwidth per operation. You pick the scheme that fits your language’s numeric semantics first, then optimize around the constraints it creates.
Inline Caches
In a dynamic language, obj.method() is not a fixed function call. The interpreter has to determine the type of obj, look up what method means for that type, and then call it. A hash table lookup on every execution is the simple implementation; it is also roughly an order of magnitude slower than what inline caches can achieve.
An inline cache embeds a small slot in the bytecode stream at each call site. The first time a property access executes, the interpreter resolves the lookup normally and records the result: the object’s shape (or hidden class, or map, depending on the VM’s terminology) and the offset of the property within objects of that shape. On subsequent executions, it checks whether the current object’s shape matches the cached shape. If it does, the property is at the cached offset and the hash table is never consulted:
LOAD_ATTR [cache: shape=0xA3F1, offset=12]
-> fast path: obj->shape == 0xA3F1, return obj->slots[12]
-> slow path: hash lookup, update cache
V8’s object model is built around hidden classes and shape transitions. When you add a property to a JavaScript object, V8 creates a new hidden class that extends the previous one, maintaining a transition chain. Objects with the same properties in the same insertion order share a hidden class. The V8 team has written extensively about this, and it remains the foundation of V8’s property access performance after more than a decade of iteration.
Inline caches degrade gracefully under polymorphism. A monomorphic cache holds one shape and is the fastest case. On a second shape, it becomes polymorphic and checks a small linear list. After a platform-specific threshold, usually four to eight distinct shapes, it goes megamorphic and consults a global stub cache or falls back to a full hash lookup. The assumption is that most call sites are monomorphic or polymorphic in practice, which holds for most real programs outside of highly generic library code.
PyPy uses a similar system under the name maps, and CPython 3.11 introduced inline caches for attribute access, binary operations, and function calls as part of the specializing interpreter. The implementation details are in the CPython internals documentation. One thing that documentation makes clear is that inline caches and the dispatch loop are coupled: CPython’s specializing interpreter replaces opcodes in place with specialized versions, which means the dispatch table and the cache are the same structure.
One subtlety worth noting: inline caches interact with value representation. NaN-boxed values make type dispatch into a bitmask check, which is faster than dereferencing a type pointer. But for inline caches to work well, your object model also needs stable shape identifiers, which imposes constraints on how objects can be mutated. Languages that allow completely arbitrary property addition at any point have to track shape transitions carefully to avoid cache invalidation cascades on every write.
How the Pieces Fit
These three areas are usually presented independently, but the coupling between them shapes the whole interpreter design. Value representation determines what type checks look like, which affects how inline caches are written. Dispatch determines the per-instruction overhead budget, which determines how many cache check operations are affordable before the overhead dominates the actual work.
The Zef implementation article covers this terrain from the perspective of building a new language, where you have the freedom to make these choices without compatibility constraints. That freedom is instructive. When you are not forced to maintain Python’s object model or JavaScript’s numeric semantics, the fastest combination emerges quickly: NaN boxing or pointer tagging for values, computed goto for dispatch, and monomorphic inline caches for property access. The hardware constraints have not changed, so the answers have not changed much either.
What varies between implementations is the boundary conditions: how much object mutability you allow, whether you need full 64-bit integers, whether you are targeting a JIT tier above the interpreter or treating the interpreter as the final execution tier. CPython 3.13 targets free threading, which forces atomic reference counts and complicates every assumption about object mutation. LuaJIT targets a specific subset of Lua semantics tight enough to let the interpreter itself be competitive with JITs. V8 accepts the full complexity of JavaScript and compensates with aggressive speculative optimization above the interpreter.
Those constraints are what make interpreter engineering a genuinely interesting design space rather than a solved problem. The techniques are known; the difficulty is in applying them under the specific constraints of your language’s semantics, your target platforms, and the engineering resources you have available.