The Register Allocation Fix That Puts CPython's JIT Back in the Game
Source: hackernews
For two releases, CPython’s JIT compiler has been technically present and largely irrelevant. Python 3.13 shipped it behind an experimental flag (--enable-experimental-jit), and pyperformance benchmarks showed it oscillating around baseline, sometimes faster, sometimes slower, never convincingly worth the complexity. Python 3.14 made incremental progress, but the picture was still muddled. A recent post by Ken Jin, one of the CPython core developers most deeply involved in this work, makes the case that 3.15 is different, and the technical reason why tells you a lot about the constraints the CPython team has been working within.
How Python Gets from Code to Native Execution
To understand why register allocation matters here, it helps to trace the full path a Python function takes through the runtime. Since Python 3.11, CPython has had what the developers call a “specializing adaptive interpreter,” which is Tier 1 of a multi-tier execution pipeline. When a piece of code runs frequently enough, the interpreter starts replacing generic bytecode instructions with specialized variants based on observed types. A generic BINARY_OP in a tight loop that always adds integers becomes BINARY_OP_ADD_INT, which skips type dispatch and hits the fast path directly. Polymorphic call sites stay generic; monomorphic ones specialize.
Tier 2 builds on top of this. Instead of specializing individual instructions in place, it extracts hot traces of execution, translates them into a lower-level intermediate representation called micro-ops (uops), and runs a series of optimization passes over them. This is where guards are inserted: a Tier 2 trace can assume a particular variable is always an integer, but it needs a guard at the start to verify this assumption and bail out to the interpreter if it fails.
The JIT then takes Tier 2 uop traces and compiles them to native machine code. This is where Python 3.15’s story gets interesting.
Copy-and-Patch: Speed at a Cost
CPython’s JIT uses a technique called copy-and-patch compilation, described in a 2021 OOPSLA paper by Xu et al. and adapted for CPython by Brandt Bucher under PEP 744. Rather than building a runtime code generator that synthesizes machine instructions from scratch, you compile template stubs for each uop at CPython’s build time using LLVM/Clang. Each stub is a small blob of machine code with “holes”: positions where runtime-specific values (addresses, constants, offsets) need to be filled in.
At JIT compile time, the process is just copying the relevant stubs together and patching those holes with real values. No register allocation, no instruction selection, no liveness analysis, no IR manipulation at runtime. The JIT compiles a trace in microseconds rather than milliseconds, and the implementation is simple enough that CPython developers without deep compiler expertise can maintain and extend it.
The trade-off is significant. Because stubs are pre-compiled without knowledge of their surrounding context, they cannot assume which CPU registers hold which values when they run. They have to be self-contained, which in practice meant they interacted with the CPython evaluation stack: pushing results to memory, reading operands from memory, and treating the stack as the canonical value store between operations.
On modern CPUs with 16 or more general-purpose registers, this is a serious problem. A tight compute loop that could keep all its working values in registers instead spills them to the evaluation stack constantly. The JIT was generating code that did more memory traffic than the interpreter on exactly the kinds of loops where you would most want it to help.
What Register Allocation Actually Fixes
The fix in Python 3.15 is register allocation applied to the uops IR before native code generation. The compiler now performs a liveness analysis over the uop trace, determines where each value is born and where it dies, and assigns CPU registers to values that can be kept there for part or all of the trace. The copy-and-patch templates were updated to work with register-allocated operands.
This removes the core architectural reason the JIT was slower than the interpreter. A value computed at the start of a trace and used ten uops later can now live in a register for the duration, rather than being written to the evaluation stack and read back each time. For numeric compute, this is the difference between code that runs at memory bandwidth and code that runs at ALU throughput.
A simplified example illustrates the issue. Consider a loop that accumulates an integer sum:
total = 0
for i in range(1_000_000):
total += i
Without register allocation, the JIT trace for total += i looks roughly like: load total from the frame’s value stack (a memory read), load i from its stack slot (another memory read), add them, store the result back to the stack (a memory write). With register allocation, total can live in rax and i in rbx for the entire hot loop, and the body reduces to a single add instruction. The difference in generated code quality is not marginal.
This is not a novel insight. Register allocation has been standard in optimizing compilers since the 1970s, and every mature JIT from V8’s Turbofan to LuaJIT’s trace compiler implements it. The question was whether CPython’s copy-and-patch architecture could support it without abandoning the simplicity that made the approach tractable. The answer, based on Ken Jin’s post, is yes.
Context from Other Runtimes
The comparison with PyPy is instructive. PyPy’s tracing JIT is approximately a decade old and implements not just register allocation but escape analysis, inline caching, constant folding across trace boundaries, and loop unrolling. On compute-heavy benchmarks, PyPy is often 3 to 8 times faster than CPython. CPython’s JIT is not trying to reach that level of optimization depth. The explicit goal has been a JIT that is maintainable by the core team, ships disabled-by-default without adding startup latency, and produces measurable wins on pyperformance.
V8’s architecture offers a different frame of reference. V8 uses a four-tier model: the Ignition interpreter, Sparkplug as a fast baseline JIT, Maglev as a mid-tier optimizing compiler, and Turbofan at the top. Each tier compiles faster but produces lower-quality code; the system promotes hot functions up the tier ladder as it becomes clear they are worth further investment. CPython’s JIT is building something closer to a two-tier system, which means the single JIT tier carries more responsibility for producing code that beats the interpreter.
There is also GraalPy, which uses GraalVM’s Truffle framework and partial evaluation to achieve extraordinary peak performance, but at the cost of significant startup overhead and a JVM dependency that makes deployment complicated. CPython’s design priorities are almost the inverse: simplicity, portability, and predictable startup times, with performance as an incremental improvement rather than the primary design constraint.
What Comes Next
Register allocation being solved does not mean the JIT will suddenly produce PyPy-level output. CPython’s JIT still lacks speculative inlining of function calls, type inference propagation across call boundaries, and escape analysis for short-lived objects. These are all possible within the copy-and-patch framework in principle, but each requires careful design work that intersects with CPython’s garbage collector, its object model, and its global interpreter lock semantics.
What it does mean is that the JIT now has a foundation from which optimizations can compound. The pre-3.15 situation was structurally broken: adding more uop-level optimizations did not help much if the code generator was discarding the results by spilling everything to the stack. The optimizer and the code generator were working against each other, and no amount of cleverness in the IR could compensate for poor register usage in the backend. That coupling is now broken.
Ken Jin’s post frames 3.15 as the release where the JIT stops being a liability on benchmarks and starts being a genuine contributor to Python’s performance story. Given the trajectory from 3.13’s mixed results through 3.14’s tentative improvements to 3.15’s register-aware backend, that framing is accurate. The JIT is not done, but it is no longer blocked on its own architecture, and that is the meaningful change.