· 7 min read ·

Adding a JIT to an Interpreter That Was Never Designed for One

Source: lobsters

The programming language runtimes that matter most in production are, by and large, decades-old C codebases. CPython has been C since 1991. Ruby’s MRI interpreter is C. Lua 5.x is C. PHP’s Zend engine is C. These implementations work, they have enormous ecosystems, and the idea of rewriting them to support JIT compilation is, practically speaking, not on the table. Adding a JIT to an existing C interpreter, without starting over, is the problem that most language performance work eventually runs into.

Laurie Tratt’s recent post covers exactly this territory, drawing on his work on Yk, a meta-tracing JIT framework built at King’s College London. The article is worth reading in full, but it prompted me to think through the full landscape of approaches and why the newest ones look meaningfully different from what came before.

The spectrum of approaches

When people talk about adding JIT compilation to an existing interpreter, they’re usually talking about one of four distinct strategies, each with wildly different cost and benefit profiles.

Copy-and-patch is the newest and most conservative. CPython 3.13 ships it. The idea, introduced by Brandt Bucher in PEP 744, is to use the C compiler itself as the JIT compiler at build time. Each bytecode operation is compiled by Clang to a native code template with “holes” where runtime values go. At JIT time, you copy the template and patch the holes:

Build time:
  each bytecode handler → clang → native template with relocation metadata

Runtime (per hot instruction):
  memcpy(template, exec_buffer)
  patch holes with runtime constants and addresses
  flush instruction cache

No IR, no register allocation, no optimization passes at runtime. The design is simple by intent: the C compiler already handles all of that when building the interpreter itself. CPython 3.13 showed roughly 5-10% gains on compute-heavy benchmarks, which sounds modest, but the infrastructure is now in place to layer type specialization on top and push those numbers higher in future releases.

Full reimplementation is the other end of the spectrum. LuaJIT didn’t retrofit anything into Lua 5.x; Mike Pall wrote an entirely new Lua implementation with a hand-optimized assembly interpreter and a custom trace compiler. LuaJIT 2.1 is typically 15-30x faster than Lua 5.4 on numeric benchmarks, and the engineering cost is proportionally large. LuaJIT represents something close to a person-decade of expert work, covers only x86/x64 and partial ARM, and is effectively unmaintained beyond its original author. Reproducing that effort for CPython or Ruby MRI is not a realistic option.

Meta-tracing sits between those extremes, but requires rewriting the interpreter in a different language. PyPy’s approach, developed by Armin Rigo and collaborators starting around 2007, involves writing the Python interpreter in RPython, a restricted and statically-analyzable subset of Python. The RPython toolchain then generates a tracing JIT automatically. The generated JIT observes the interpreter’s own execution loop, records a “trace” of the low-level operations performed for each Python bytecode, and compiles those traces to native code. The result is a JIT that sees through the interpreter’s dispatch mechanism and generates code as if the Python program had been compiled directly, as Bolz et al. described in the original meta-tracing paper.

PyPy achieves 5-20x speedups over CPython on loop-heavy numeric code. The cost is that the Python interpreter had to be rewritten in RPython, which represents years of engineering work and a permanently forked codebase. That forked maintenance burden is what Yk is specifically designed to eliminate.

What Yk does differently

Yk’s core idea is meta-tracing for C interpreters, without requiring a rewrite. The key enabler is hardware tracing via Intel Processor Trace (Intel PT) and, on ARM systems, CoreSight. These hardware features record the exact control flow taken through a running program at near-zero overhead, typically under 5% runtime cost. Yk uses this to reconstruct what the interpreter executed, without requiring the interpreter to call a tracing function at each step.

The minimal changes required to an existing C interpreter look roughly like this:

// Mark the back-edge of the interpreter's main dispatch loop:
YK_MT_TRACE_LOOPBACK(frame, pc);

// Mark the start of each bytecode handler:
YK_MT_NEW_BASIC_BLOCK();

An LLVM compilation pass instruments the C code at build time; Yk’s runtime handles hardware trace recording and trace compilation at runtime. The interpreter author doesn’t need to understand tracing internals to benefit from the JIT.

Yk was retrofitted into Lua 5.4 (producing YkLua), with early benchmarks showing 2-8x speedups over plain Lua and parity with LuaJIT on some numeric benchmarks. Those numbers are preliminary, but they’re in a range that makes the approach credible as more than a research demonstration. The OOPSLA 2023 paper describes the underlying system in detail.

Why Intel PT changes the calculus

Previous attempts at meta-tracing systems for C interpreters hit a fundamental instrumentation overhead problem. To trace what an interpreter is doing, you need to observe every branch it takes. Inserting software instrumentation at each branch point adds per-instruction overhead that can easily overwhelm any JIT benefit during the warmup phase, often making the instrumented interpreter slower than the baseline.

Intel PT moves the observation into hardware. The CPU writes compact branch records to a ring buffer in memory as it executes, with overhead low enough to leave on permanently in profiling contexts. Yk reconstructs full traces from these records by replaying them against the known control-flow graph of the compiled interpreter binary. This eliminates the overhead problem that previously made software-only C interpreter tracing impractical for production use.

Intel PT is an Intel-specific feature, though AMD has comparable functionality in its branch record extensions and ARM CoreSight provides similar capability on ARM v8.1+ systems. A JIT framework that depends on hardware tracing is not as portable as an LLVM-based approach. For server-side deployments where the hardware is known and controlled, this is an acceptable constraint. For embedded or heterogeneous environments, it limits applicability.

The portability concern is also partly a timing concern. As Intel, AMD, and ARM hardware converges on broadly available control-flow tracing, the constraint shrinks. The question is whether a JIT framework designed around hardware tracing ends up with a narrower window than its alternatives, or whether the hardware simply catches up before it matters.

The unsolved problems

Even with hardware tracing, several hard problems remain. Guard failures are the central one. Every branch in a recorded trace becomes a “guard” in the compiled output. If runtime behavior violates the guard (a type changes, a value falls outside an expected range), the JIT must fall back to the interpreter at exactly the right program state. Getting that deoptimization correct, and doing it without corrupting interpreter state, is where subtle bugs accumulate in practice. The correctness requirements for deoptimization are significant even after the hardware tracing infrastructure is in place.

Warmup cost is the other persistent issue. Tracing-based JITs require a loop to run many times before it’s identified as hot, traced, and compiled. For long-running server processes this is a non-issue, but for CLI tools or short-lived scripts the JIT may never pay for itself. This is not a problem specific to Yk; it applies equally to PyPy and LuaJIT. Copy-and-patch sidesteps it somewhat by having very low compilation cost, but at the price of lower code quality.

The third problem is compilation tiering. Copy-and-patch compiles immediately at near-zero cost, accepting lower code quality. Meta-tracing produces better code but takes longer to compile. Production JITs like V8 solve this with multiple tiers: a fast baseline compiler running alongside a slower optimizing compiler, each handling different parts of the hot-code distribution. Yk currently operates as a single-tier system, and adding tiers would require substantially more implementation work. The gap between “JIT that works” and “JIT that has the right tier structure for production” is non-trivial.

Where this leaves interpreter maintainers

The practical picture for a language runtime maintainer is roughly this:

  • Copy-and-patch is achievable with moderate effort and gives modest near-term gains. CPython’s experience shows it’s a viable path to incremental improvement without large architectural changes. The ceiling is lower than other approaches, but so is the cost.
  • Yk is the most promising approach for substantial gains without a rewrite, particularly for interpreters running on Intel or ARM server hardware. YkLua’s benchmark results are genuinely encouraging as a proof of concept, and the minimal annotation requirement is a real advantage over PyPy’s approach.
  • Meta-tracing via PyPy or GraalVM’s Truffle framework requires a complete reimplementation and is only viable for languages willing to maintain a parallel implementation with a dedicated engineering team.

What Tratt’s article contributes to this picture is a practitioner’s account of what it actually takes to retrofit Yk into a real interpreter. The minimal annotation claim deserves scrutiny when applied to a large, complex codebase like MRI Ruby or PHP’s Zend engine, where interpreter state is spread across dozens of C structs and the dispatch loop has been modified over decades by hundreds of contributors. The next real test for this class of approach is not benchmark results on a demonstration VM, but whether the machinery holds up when the C code is messy, historical, and nobody fully understands all of it anymore. That gap between a proof-of-concept and a production-ready result is where most promising language implementation ideas eventually slow down.

Was this interesting?