· 8 min read ·

Tracing Through C: The Design Space Behind Retrofitted JIT Compilers

Source: lobsters

Most discussions of JIT compilation focus on code generation: how to turn an intermediate representation into fast machine code, how to allocate registers, how to schedule instructions for the target microarchitecture. Code generation is largely a solved problem; LLVM, Cranelift, and similar backends handle it well enough that rolling your own is rarely justified. The difficult parts of JIT compilation lie elsewhere, and Laurie Tratt’s recent work on retrofitting JIT compilers into C interpreters is a useful lens for understanding what those parts are.

The context is Tratt’s ongoing Yk JIT project, which takes an unusual approach: rather than asking interpreter developers to rewrite their code in a JIT-friendly language or build a parallel compiler infrastructure, Yk traces through the C interpreter’s own execution at the LLVM IR level. The interpreter developer annotates the dispatch loop with a small set of macros and declares which variables constitute the interpreter state; Yk handles the rest. To understand why that is interesting, it helps to look at what the existing approaches to this problem actually require.

What Makes a C Interpreter Hard to JIT

An interpreter written in C is just a C program. It has a main dispatch loop, typically implemented with a switch statement or computed goto over bytecode opcodes. It maintains execution state in whatever data structures the original developers found convenient, manages memory through its own allocator and possibly its own garbage collector, and handles exceptions with setjmp/longjmp or equivalent. None of this was designed with JIT compilation in mind.

A JIT compiler needs three things from the interpreter that an ordinary C interpreter does not provide by design.

The first is hot-path identification. A tracing JIT needs to detect loop back-edges in the guest program and count how frequently they execute. In a bytecode interpreter, these back-edges exist at the bytecode level, but the JIT observes them through the C dispatch loop. Identifying where one guest-language loop iteration ends and the next begins requires either explicit annotation or binary analysis, and those two approaches have very different engineering costs.

The second is semantic reproduction. Once a hot path is identified, the JIT generates code that behaves identically to the interpreter on that path but faster. For a trace JIT this means capturing a linear sequence of operations for one loop iteration, optimizing it, and emitting native code with guards that check the assumptions the optimizations relied on.

The third, and hardest, is deoptimization. Guards fail, type assumptions turn out to be wrong, and uncommon branches get taken. When any of this happens, the JIT must hand control back to the interpreter at a precise point in the guest program with a fully valid interpreter state. If the JIT has reordered stores, eliminated allocations, or kept values in registers rather than on the interpreter stack, reconstructing that state requires careful bookkeeping threaded throughout the entire compilation process.

All three of these requirements push against what a production C interpreter was built to be.

Copy-and-Patch: Deliberately Avoiding the Hard Parts

CPython 3.13 introduced a JIT based on the copy-and-patch technique from Xu et al.’s 2021 OOPSLA paper. The design is notable for what it avoids. At CPython build time, LLVM compiles each bytecode instruction into a machine code stencil: a template with holes for operands, addresses, and constants. At runtime, the JIT assembles traces by copying these stencils in sequence and patching the holes with concrete values from the specific execution context. No LLVM at runtime, no register allocation, no cross-instruction optimization.

Deoptimization is trivially simple under this model: each piece of generated code corresponds directly to a single bytecode instruction, and the interpreter state is always current. The JIT never reorders memory operations across instruction boundaries, so alias analysis is not a concern. The cost is optimization scope; early benchmarks on the pyperformance suite showed roughly 2 to 5 percent improvement overall, though tight numeric loops showed more. On a pure Python Newton’s method loop, for instance, the speedup is noticeably larger because the JIT eliminates repeated bytecode dispatch overhead on a path where the types are stable.

Work on CPython 3.15 is extending the stencil system with register allocation and better type inference, which will allow some cross-instruction optimization while keeping the basic architecture intact. The bet here is that a maintainable, safe system that can be improved incrementally is worth more than a more aggressive JIT that requires deep expertise to modify. Given that CPython has hundreds of contributors and ships to hundreds of millions of users, that bet is defensible.

LuaJIT: The Limit of What Retrofit Can Mean

LuaJIT is the most frequently cited success story for JIT-compiling a dynamic language interpreter. On numeric workloads it runs 10 to 50 times faster than the stock Lua 5.x interpreter, and on some benchmarks it approaches C. Mike Pall built a tracing JIT with a custom SSA IR, a full set of optimization passes including loop-invariant code motion and allocation sinking, and a register-allocating native code backend for x86/x64 and ARM. The original Lua 5.1 bytecode interpreter remains intact; hot loop back-edges redirect to compiled code, and guard failures return to interpreter execution.

Calling this a retrofit is accurate in that Pall did not rewrite the interpreter in a different language. The level of control required, however, was substantial. The interpreter’s stack layout, object representation, and garbage collector interface were all made compatible with what the JIT needed because Pall was making that determination, not inheriting it from a fixed existing codebase. The interpreter and JIT share a unified frame format designed to support both execution paths. That is a different problem than adding a JIT to CPython or Ruby’s YARV, where the frame format is fixed by years of production use, a large contributor community, and a stable C extension API that cannot change without breaking the ecosystem.

LuaJIT’s performance is remarkable, but its architecture is also the reason it has remained frozen at Lua 5.1 semantics while the rest of the Lua ecosystem moved forward. The JIT and interpreter are so deeply intertwined that keeping them in sync with a moving language target is a significant ongoing burden. That is another real cost of the approach.

Yk: Tracing Through C

The Yk JIT project takes the most ambitious approach: trace through the C interpreter’s actual execution at the LLVM IR level, optimize the captured trace with LLVM’s optimization pipeline, and emit native code with guards and deoptimization support. The interpreter developer marks loop heads with macros and declares which C variables constitute the interpreter state that must survive deoptimization. The tracing and compilation machinery is generic across interpreters; the same framework can be applied to any annotated C interpreter.

The aliasing problem here is genuine. A C interpreter makes extensive use of pointer arithmetic, pointer casts, and manual memory management. LLVM’s alias analysis works best when it has type annotations and restrict qualifiers to reason from; a trace through compiled C interpreter code provides less of that information. Where alias analysis is uncertain, the optimizer cannot safely hoist loads out of loops, eliminate redundant stores, or reorder operations that might alias in memory. This is one reason the state declarations matter: by explicitly identifying which memory locations constitute interpreter state, the developer gives the system enough information to treat those locations conservatively while being more aggressive elsewhere.

Deoptimization in Yk uses shadow stacks, a parallel representation of the interpreter frame maintained alongside the JIT-compiled code and consulted when a guard fires. When execution falls back to the C interpreter, the shadow stack provides the state reconstruction data needed to resume at the correct bytecode position with a valid interpreter state. This is structurally similar to how V8’s TurboFan handles deoptimization through its frame state infrastructure, but operating at the level of C execution rather than a purpose-built bytecode IR designed for this use.

The PyPy Comparison

PyPy’s RPython meta-tracing approach offers a useful contrast. The interpreter is rewritten in RPython, a restricted subset of Python; the RPython toolchain then generates a tracing JIT that operates at the interpreter’s bytecode level, with full type specialization, escape analysis, and allocation removal. PyPy regularly runs 5 to 10 times faster than CPython on numeric workloads, and the JIT quality is high because it was generated from a representation the toolchain fully understands.

The cost is abandoning the original C codebase; for CPython, that codebase represents decades of optimization, extension module compatibility, and contributor knowledge. The RPython approach also traces at the semantic level of the interpreter’s bytecode, not at the C implementation level, which means the optimizer reasons about guest language operations directly rather than through the C code that implements them. That is a significant advantage for optimization quality, and it is what Bolz et al. described in the foundational PyPy tracing JIT paper at ICOOOLPS 2009. For most production language implementations, though, rewriting the interpreter from scratch to gain that advantage is not a practical path.

What the Design Space Reveals

The choices made by these four projects map onto a clean tradeoff surface. Copy-and-patch maximizes engineering safety and maintainability at the cost of optimization depth. LuaJIT maximizes optimization depth at the cost of requiring deep interpreter control and accepting a frozen language target. RPython maximizes semantic-level JIT quality at the cost of requiring a full interpreter rewrite. Yk attempts to maximize retrofitability while maintaining access to LLVM-quality optimization, and pays for that in the complexity of alias analysis and deoptimization at the C level.

The difficulty of retrofitting a JIT is not primarily about compiler technology. It is about the mismatch between what a production C interpreter was designed to be, a correct, efficient, maintainable program, and what a JIT compiler needs it to be, an observable, interruptible, deoptimizable state machine. Closing that gap without rewriting the interpreter from scratch is the actual research problem. The fact that it remains an active research problem in 2026 reflects just how deep that mismatch goes.

For interpreter developers who want JIT performance without committing to a full rewrite, the practical landscape is narrowing toward two options: copy-and-patch for a safe, incremental improvement with bounded ceiling, or something like Yk if the optimization ceiling matters enough to justify the annotation overhead and the additional runtime complexity. Neither is free, and the design of the original interpreter will constrain both paths more than most developers expect before they try.

Was this interesting?