From Copy-and-Patch to LLVM IR: The JIT Retrofitting Spectrum

Most production language runtimes are written in C: CPython, CRuby, PHP’s Zend engine, the reference Lua interpreter. They were written as interpreters, either tree-walking or bytecode-dispatching, without consideration for eventual JIT compilation. They work, they are fast enough for most purposes, and they have huge ecosystems. Nobody wants to rewrite them.

As compute-heavy workloads in Python, Ruby, and PHP become more common, “fast enough” starts to feel insufficient. Adding a JIT to a C interpreter looks straightforward from the outside: code generation libraries exist, LLVM has JIT infrastructure, and hand-rolled assemblers like DynASM have been around for decades. Code generation is well-served by existing tools; the engineering complexity concentrates in managing the transition between interpreted and compiled execution, specifically the ability to bail out of compiled code and resume in the interpreter when a speculative assumption fails. This is the deoptimization problem, and it is the reason every JIT retrofit strategy ends up making significant compromises.

Laurie Tratt’s recent post on this topic is worth reading alongside the implementations that have shipped. His yk project at King’s College London is the most technically ambitious attempt to solve this problem generically, but understanding what copy-and-patch, YJIT, and PyPy each reveal about the design space first gives the yk work its full context.

Copy-and-Patch

CPython 3.13 shipped a copy-and-patch JIT developed by Brandt Bucher, based on a technique from a 2021 OOPSLA paper by Xu et al. The mechanism: at build time, Clang compiles each bytecode handler into a machine code template with placeholder holes for operand values, jump addresses, and other runtime data. At JIT time, the runtime copies relevant templates end-to-end into an executable buffer and patches the holes with actual values.

The approach sidesteps instruction selection and register allocation by delegating both to the build-time compiler. The JIT does a memcpy and a handful of pointer writes. There is no cross-instruction optimization; each template was compiled independently, and values flow through memory between templates rather than staying in registers. The result is essentially a dispatch-free interpreter: bytecode dispatch overhead disappears, but the optimization quality that makes a JIT worth writing does not materialize. CPython 3.13 shows roughly 5% improvement on the pyperformance suite, with tighter numeric loops doing somewhat better.

CPython’s developers position copy-and-patch as a foundation. The tier-2 optimizer handling type specialization and simple value propagation above it is where more meaningful improvements are targeted for 3.14 and 3.15. Whether that tier-2 layer can eventually close the gap with something like PyPy is an open question; for now, the approach delivers incremental wins at low engineering cost.

YJIT

YJIT, Shopify’s JIT for CRuby (merged in Ruby 3.1), uses basic block versioning. Rather than treating bytecodes individually, YJIT compiles one basic block at a time with type specialization based on the type context at block entry. Multiple versions of the same block can exist for different input types, and these blocks link lazily into a connected graph of native code.

The gains are meaningfully larger than copy-and-patch. Type guards at block boundaries enable constant folding, unboxing of numeric types, and inline caches for method dispatch. YJIT shows roughly 15-20% improvement on web workload benchmarks and 2-3x on compute-heavy code. The VMIL 2021 paper covers the technical design in detail.

What this required from CRuby was deep familiarity with its execution context structure, rb_ec_t and rb_control_frame_t. Shopify’s engineers chose a “frame always valid” strategy: generated code continuously updates the interpreter frame struct as it executes, rather than reconstructing it only when a guard fails. This adds overhead to compiled code but trivializes deoptimization: a guard failure simply jumps to the interpreter dispatch loop, which finds its frame in a valid state.

Ruby 3.2 rewrote YJIT’s code generation in Rust using Cranelift, a fast, portable code generator designed originally for WebAssembly compilation. Cranelift is easier to target than LLVM’s JIT APIs, generates decent code, and compiles quickly; it has become a common choice for language JIT backends that need portability without LLVM’s compilation latency.

PyPy and Meta-Tracing

PyPy takes a structurally different approach. Rather than retrofitting a JIT into a C interpreter, PyPy’s RPython framework requires the interpreter to be written in RPython, a restricted subset of Python. The meta-tracing JIT traces the RPython interpreter loop itself, recording traces that span multiple guest-language bytecodes. The interpreter author annotates the dispatch loop with jit_merge_point and can_enter_jit hints; the meta-tracing machinery handles the rest.

This produces high-quality optimization: allocation removal can keep short-lived objects in registers rather than on the heap, constant folding operates across instruction boundaries, and type specialization is driven by dynamically observed types. PyPy runs roughly 4-5x faster than CPython on pyperformance and dramatically faster on tight numeric loops.

The requirement is rewriting the interpreter in RPython. For Python, PyPy has been doing this for two decades, and the investment has paid off in performance terms. For a language whose canonical implementation is an existing, widely deployed C codebase, this is not a viable path.

The Black Box Problem

Once GCC or Clang has compiled the interpreter source, the semantic structure is gone. A “read guest-language local variable at slot 3” in the C source becomes a load from some memory address in the compiled binary. A “check whether value is an integer” becomes a comparison against some bit pattern. A JIT operating at the machine code level can record and replay machine code traces, but it cannot perform meaningful guest-language-level optimization because it cannot distinguish interpreter bookkeeping from guest-language semantics.

YJIT solved this by re-encoding the knowledge manually. Shopify’s engineers understood CRuby’s frame layout and built that understanding into the code generator. This is effective for a well-resourced project targeting one specific interpreter. Writing a YJIT-style JIT for the PHP Zend engine or for Lua’s C implementation would require starting the same deep familiarization process from scratch for a different codebase.

yk and LLVM IR Tracing

Tratt’s yk project attacks the black box problem at its source by operating at the LLVM IR level rather than machine code. The interpreter is compiled with ykllvm, a modified LLVM fork that instruments the IR to support trace recording and deoptimization. The interpreter author adds a small number of calls to the yk C API in their dispatch loop. A minimal annotated interpreter looks roughly like this:

#include <yk.h>

YkMT *mt = yk_mt_new(NULL);
YkLocation loc = yk_location_new();

while (running) {
    yk_mt_control_point(mt, &loc);
    switch (*interp->pc) {
        case OP_ADD: /* ... */
        case OP_LOAD: /* ... */
    }
}

yk_location_drop(&loc);
yk_mt_drop(mt);

The yk_mt_control_point call at the top of the dispatch loop is where all the tracing machinery lives. On cold iterations it is a near-zero-overhead check. On hot paths, yk records the LLVM IR instructions that executed during that iteration. Once enough trace data is collected, the recorded IR passes through LLVM’s full optimization pipeline: constant folding, dead code elimination, load-store elimination, loop-invariant code motion. Because optimization operates on LLVM IR before the C compiler destroys semantic structure, these passes can do meaningful work at the guest-language level; a repeatedly-read local variable becomes loop-invariant at the IR level because the IR still expresses what that load corresponds to.

Deoptimization uses LLVM’s stackmap intrinsic, extended by ykllvm with interpreter-specific metadata. At each guard point in the compiled trace, the JIT records a mapping from its register state back to the interpreter’s frame layout. Guard failure drives a precise reconstruction of interpreter state. This is conceptually the same mechanism HotSpot JVM uses for its deoptimization tables, applied to a C interpreter. Standard LLVM stackmap support was not designed for this use case, so ykllvm had to extend it accordingly.

The yk project has demonstrated this approach on toy interpreters and several research-scale systems. On compute-heavy micro-benchmarks, results approach LuaJIT-class speedups on small interpreters. Demonstrating this on a production-scale interpreter is the next validation step.

The Spectrum

These approaches occupy clearly different positions in the same trade-off space.

Copy-and-patch requires no knowledge of interpreter state and gets modest optimization. YJIT requires deep knowledge of one specific interpreter’s internals, hand-coded into the JIT, and gets good optimization for that interpreter. yk requires compiling the interpreter with ykllvm and adding a handful of API calls to the dispatch loop, aiming for high-quality optimization for any interpreter compiled that way, with the engineering cost front-loaded in the toolchain rather than in per-interpreter work.

LuaJIT sits outside this spectrum. Mike Pall built it from scratch with tracing in mind; he had full control over the bytecode design, value representation (NaN-boxing for tagged values), and the custom code generator. The 10-50x speedups over PUC Lua reflect what is achievable when the interpreter is designed for JIT from the start. Everything in the retrofitting space is working to close that gap from a position Pall never had to deal with.

Whether yk’s approach scales to a production interpreter and delivers LuaJIT-adjacent optimization quality remains the open question. Tratt’s post frames the problem precisely; the results will come from the implementation work his group is doing now.