The Register Allocation Gap That Kept CPython's JIT From Mattering
Source: hackernews
Python’s JIT compiler has existed in a frustrating middle ground for two release cycles. It shipped in 3.13, it continued in 3.14, and in both cases the performance results were underwhelming enough that the feature remained off by default. The recent post from Ken Jin on the CPython JIT’s trajectory signals that 3.15 is genuinely different, and the reason comes down to a fundamental compiler problem that the initial implementation deliberately deferred: register allocation.
How CPython Got a JIT at All
The path to CPython having a JIT is longer than most people realize. The broader Python ecosystem has tried this multiple times: PyPy has had a tracing JIT since around 2009, Meta’s Cinder fork experimented with method JIT compilation, and Microsoft’s Pyjion project attempted to plug LLVM into CPython. None of these made it into mainline CPython, for reasons that ranged from maintenance burden to the fundamental difficulty of JIT-compiling a language as dynamic as Python.
What changed with 3.13 was the compilation technique. Brandt Bucher and the CPython team adopted “copy-and-patch” JIT compilation, a method that sidesteps the hardest parts of building a traditional JIT. Instead of generating machine code at runtime from scratch, the compiler uses Clang/LLVM at build time to compile small code templates called stencils. Each stencil is a precompiled fragment of machine code with holes left in it for values that will only be known at runtime: addresses, constants, operand locations. When the JIT activates a trace, it copies the relevant stencils into executable memory and patches those holes. The result is a JIT with no runtime dependency on LLVM, minimal code complexity, and near-instant compilation latency.
This technique comes from academic work on copy-and-patch compilation by Xu Gao, Andreas Peng, and Fredrik Kjolstad, and it’s the same approach used for .NET 6’s template JIT. For CPython, it was a smart choice: it kept the implementation small enough to land in mainline and gave the team a foundation to iterate on.
What the Tier 2 Optimizer Adds
The JIT doesn’t compile raw Python bytecode. It sits on top of CPython’s tier 2 optimizer, which itself sits on top of the specializing adaptive interpreter added in 3.12. The three-tier structure matters:
- Tier 1 is the specializing interpreter. When a code object runs hot, it replaces generic opcodes with specialized variants.
LOAD_ATTRbecomesLOAD_ATTR_INSTANCE_VALUEwhen the interpreter observes that it’s always loading an instance attribute from the same slot. - Tier 2 takes those specialized traces and applies longer-range optimizations: dead code elimination, guard reduction, and partial evaluation of type information. The output is a sequence of micro-ops (uops).
- The JIT compiles tier 2 uop traces into machine code using copy-and-patch.
The tier 2 optimizer was the prerequisite that made the JIT feasible. Without it, the JIT would be compiling nearly the same operations as the interpreter, just slightly faster due to skipping dispatch overhead. With type information propagated through a trace, the JIT can generate code that skips redundant type checks and operates on concrete representations.
Why 3.13 and 3.14 Underdelivered
CPython 3.13’s JIT landed with honest disclaimers: it was experimental, off by default, and not yet producing meaningful speedups on the pyperformance benchmark suite. The honest measurement at the time was roughly 1-5% regression on some workloads. Python 3.14 improved the situation but still left the JIT as an opt-in feature with no compelling performance story for the typical user.
The underlying reason is register allocation, or rather the absence of it.
In a typical JIT compiler, one of the most important tasks is deciding which values live in CPU registers at any given point. Registers are fast; memory accesses are slow. A JIT that forces a value to a stack slot every time it’s computed, then loads it back immediately for the next operation, is generating correct but inefficient code. The performance difference between good and poor register allocation on modern hardware is substantial, often more significant than the difference between interpreted and compiled execution of the same operations.
The initial copy-and-patch JIT deferred this problem. Each stencil is essentially self-contained: it loads its inputs, does its work, and stores its outputs. The handoff between stencils goes through a fixed interface that uses memory rather than registers. This made the implementation tractable and correct, but it meant the generated code was spending significant time shuttling values between registers and memory at stencil boundaries, work that a smarter register allocator would eliminate by keeping live values in registers across the boundary.
Coupled with that, the JIT traces needed to be long enough for the overhead of entering and exiting JIT-compiled code to pay off. Short traces with high guard failure rates meant the JIT was paying the compilation and entry cost without spending enough time in the fast path to amortize it.
What 3.15 Is Changing
Ken Jin’s post describes concrete progress on register allocation within the copy-and-patch framework. The approach involves redesigning how stencils hand off values to each other, allowing the JIT to keep values in specific registers across stencil boundaries rather than forcing round-trips through memory. This is not a full general-purpose register allocator in the traditional sense; it’s designed around the specific structure of uop traces and the constraints of the stencil model.
The distinction matters. A traditional register allocator works on control flow graphs with arbitrary structure. A trace-based allocator works on a linear sequence with guarded exits, which is much simpler. Values are live from when they’re produced until their last use within the trace, and the exit points are explicit. This means you can implement a reasonably effective allocator with much less complexity than a full SSA-based allocator.
The practical consequence is that the JIT can now generate code where a value computed by one uop stays in a register through several subsequent uops that use it, instead of being spilled and reloaded at each stencil boundary. For tight numeric loops, the impact of this should be substantial.
The Break-Even Problem
One thing that often gets lost in coverage of Python’s JIT is what “break-even” actually means in this context. The JIT is competing against the specializing interpreter, not against naive interpretation. The specializing interpreter is already quite fast for the operations it’s been tuned for. Benchmark suites like pyperformance contain a mix of workloads, many of which are I/O-bound or involve substantial Python-level string manipulation where the JIT’s numeric advantages don’t apply.
This means the JIT needs to be measurably faster on the compute-intensive subset of benchmarks, and no slower on everything else, before it makes sense to enable by default. Getting a 15% speedup on nbody while showing a 2% regression on django_template would not be a compelling tradeoff for enabling the JIT universally.
Register allocation directly addresses the JIT’s weak performance on compute-heavy loops. When Python code is doing arithmetic, iterating tightly, or manipulating arrays via buffer protocol, the bottleneck shifts from interpreter dispatch overhead to actual computation. This is exactly where keeping values in registers matters most, and exactly where the previous stencil handoff strategy was leaking performance.
Context From Other Language Runtimes
It’s worth noting how long this kind of iteration takes even in well-resourced projects. V8’s TurboFan JIT replaced Crankshaft after years of development and still went through multiple phases of rework. The JVM’s JIT compilers, C1 and C2, have been actively developed for over two decades. LuaJIT, widely cited as one of the most impressive JIT implementations for a dynamic language, took Mike Pall many years of full-time work to reach its current state.
CPython is doing this in mainline, with contributors working part-time on the JIT alongside other CPython development, and with the constraint of keeping the non-JIT path completely unaffected in correctness and performance. The fact that a working JIT shipped in 3.13 at all, using a novel compilation technique, is genuinely impressive engineering. The 3.15 work on register allocation is the next logical step in a planned progression, not a surprise pivot.
The original copy-and-patch paper explicitly anticipated that the initial implementation would leave performance on the table and that more sophisticated value management would come in later iterations. That is what is happening now.
What This Means in Practice
For most Python developers, the immediate practical consequence is nothing. The JIT will still be opt-in in 3.15, and even once it’s on by default, you’re unlikely to change how you write Python based on JIT behavior. The performance improvements, when they arrive, will show up as faster execution of the code you’re already writing.
For the subset of Python users running compute-heavy workloads, especially those who have been using NumPy or Cython as a workaround for Python’s speed in tight loops, a functioning JIT that handles numeric uop traces efficiently is potentially meaningful. Not a replacement for NumPy, but a reduction in the penalty for Python-level iteration logic surrounding NumPy calls.
The longer arc here is that CPython is building the infrastructure for a JIT that can be genuinely improved over time. Register allocation is one layer. Better type inference feeding into the tier 2 optimizer is another. More effective guard elimination is a third. Each improvement compounds on the others. Getting register allocation right in 3.15 is not the end of the JIT story; it’s closer to the beginning of the part where it starts mattering.