What 'Back on Track' Actually Means for Python's Copy-and-Patch JIT

The copy-and-patch JIT that shipped experimentally in Python 3.13 was never going to be a silver bullet. It was, by design, a minimal viable just-in-time compiler: a first step on a long road that Mark Shannon’s Faster CPython team had been walking since Python 3.11. The recent report that the JIT is back on track for Python 3.15 is worth unpacking not just as good news, but as a window into the real constraints and trade-offs behind CPython’s performance roadmap.

How Copy-and-Patch Works

This is not a traditional JIT in the V8 or HotSpot sense. There is no intermediate representation, no register allocator, no loop-level optimization passes. The name describes the mechanism precisely: at CPython build time, a clang-based toolchain compiles machine code templates for each micro-op (uop), leaving deliberate holes where runtime addresses and constants will go. When the JIT decides to compile a trace, it copies those templates end-to-end into an executable memory region and patches the holes with actual values.

The uop pipeline builds on the specializing adaptive interpreter introduced in Python 3.11. After a bytecode instruction executes enough times, CPython replaces it with a type-specialized variant: LOAD_ATTR becomes LOAD_ATTR_INSTANCE_VALUE, BINARY_OP becomes BINARY_OP_ADD_INT, and so on. The JIT operates one level below this, decomposing bytecodes into finer-grained uops and then compiling hot traces through those uops.

A trace is a linear sequence of uops representing a frequently-taken path, usually through a loop body. Guards at the trace entry verify that the runtime types and conditions still match what was observed when the trace was compiled. If a guard fails, execution falls back to the interpreter and the mismatched branch continues there. This is the same basic model LuaJIT uses, and it is well-suited to dynamic languages because it defers specialization decisions until the types are actually known.

The build toolchain requirement is one of copy-and-patch’s least-discussed complications. Building CPython with JIT support requires clang, even on Linux systems where GCC is the typical compiler. The templates are generated by compiling small C files that define each uop’s machine code, then post-processing the resulting object files to extract the template bytes and their relocation metadata. This is codified in PEP 744, which specified the JIT for 3.13.

Why It Underperformed in 3.13

The 3.13 release demonstrated that the JIT worked correctly but did not justify enabling it by default. Benchmarks on pyperformance showed results ranging from essentially flat to a few percent faster, with some regressions on specific workloads. Several compounding issues explain this.

Guard overhead is the most fundamental. A trace that checks the type of a loop variable at every iteration pays for those checks in the compiled version just as it would in the interpreter. Without the ability to hoist or eliminate redundant guards, the JIT spends a meaningful fraction of its cycles on checks that would be provably unnecessary in a smarter compiler. The overhead is especially visible in loops with predictable, stable types, which is precisely where you would expect a JIT to help most.

Trace formation heuristics were conservative. The algorithm for deciding where a trace starts and ends has historically favored shorter traces, which limits how much work the compiled code can do before returning to the interpreter. Loops with small bodies but frequent branches may never produce a trace long enough to amortize the compilation cost. Short traces also give the uop optimizer less context to work with, since it reasons over the sequence of uops in a trace rather than across trace boundaries.

The uop optimizer itself was limited in what it could prove. It had a basic type lattice: it could distinguish between int, float, str, and a few other types, but it could not track narrower properties like non-negative integers or small integers. Without that granularity, many guard eliminations that are theoretically possible require proving things the optimizer cannot express.

There were also correctness issues in edge cases. CPython’s reference counting model interacts with JIT-compiled traces in non-obvious ways, and the garbage collector’s assumptions about object lifetimes add another layer of complexity. Some code patterns required conservative workarounds that effectively prevented those paths from being JIT-compiled at all.

What Changes for 3.15

The source article signals that specific blockers have been addressed. Based on the development trajectory through this period, the improvements span the uop optimizer, trace formation, and correctness.

The optimizer has been gaining a richer type lattice. Tracking not just whether a value is an int but whether it is a small int or a known-positive int enables classes of guard elimination that were previously impossible. This is a step toward something like abstract interpretation over the uop stream, where the optimizer can propagate value information forward through a trace and prune checks that would always pass.

Trace formation has been revised to allow longer traces and to be more selective about which backward edges anchor new traces. A longer trace gives the optimizer a bigger window and gives the compiled code more useful work to do per entry. The side effect is more compilation time and more memory for the trace cache, so the heuristics involve genuine trade-offs rather than just turning one dial up.

The correctness issues have been getting steady attention. Some of the conservative workarounds in 3.13 are being replaced with precise handling, which recovers JIT coverage for code patterns that were previously falling back to the interpreter more than necessary.

The Comparison With PyPy and GraalPy Worth Making

PyPy has had a mature tracing JIT for over a decade. On compute-heavy benchmarks, it routinely runs Python code five to ten times faster than CPython. GraalPy, running Python on GraalVM’s compiler infrastructure, can exceed that on some workloads because it has access to decades of research in optimizing compiler design. Copy-and-patch cannot close that gap without growing into something substantially more complex.

CPython’s JIT is not trying to beat PyPy on benchmark throughput. Its constraints are different. It must be maintainable by CPython’s core team without requiring JIT compiler expertise for every patch. It must not significantly increase startup time, memory usage, or binary size for users who never invoke a hot loop. It must be correct across CPython’s full semantics, including the parts that alternative implementations sometimes handle differently. Within those constraints, a modest reliable speedup on pure Python loops is a genuine improvement.

For most Python code in production, the bottleneck is not the interpreter dispatch loop. Code that spends its time in NumPy, PyTorch, or other C extensions will see no benefit from a JIT that operates only on Python-level uops. The JIT matters for tight pure Python loops: numerical code that has not been offloaded to a native library, parsers, bytecode interpreters implemented in Python, and business logic with dense iteration over data structures.

The Longer View

Python’s performance trajectory over the past several releases has been real and compounding. The specializing adaptive interpreter in 3.11 delivered around 25% speedup across pyperformance. The work since then keeps building on that base. Getting the JIT to a stable, always-on state in 3.15 is not the destination; it is the precondition for the next phase.

Register allocation, type-specialized code generation, and loop transformations are all harder problems than copy-and-patch, but they become tractable once the simpler foundation is solid. The uop optimizer’s growing type lattice is already pointing in that direction. Each release narrows the gap between what the optimizer can prove and what the code actually does, and that gap is where most of the remaining performance lives.

For anyone building pure Python systems where interpreter overhead is a measurable cost, 3.15 is worth watching closely. The JIT being back on track means the experiment is no longer in question; what remains is refining the output quality until the speedups justify enabling it unconditionally.