The JIT Pipeline Python 3.15 Completes Was Four Releases in the Making
Source: hackernews
When Ken Jin posted that Python 3.15’s JIT is on track, the conversation focused on register allocation as the critical addition. That framing is correct, but it is not the whole story. Register allocation completes a pipeline that was laid down across four prior releases, each one addressing a specific prerequisite the next step depended on. Looking at the 3.15 result in isolation misses why it took this long and why it could not have happened earlier.
3.11: Where Type Information Comes From
Before CPython could have a useful JIT, it needed runtime type information. PEP 659, the Specializing Adaptive Interpreter shipped in Python 3.11, provided this. The mechanism works through inline caches embedded directly in the bytecode stream. Instructions like LOAD_ATTR have cache slots allocated alongside them in the bytecode array. As execution repeats, the interpreter records what it observes at each instruction site: which type is the object, which attribute offset was resolved, which class version tag was current.
After enough repetitions (typically eight), a generic instruction specializes. BINARY_OP with two integer operands becomes BINARY_OP_ADD_INT. The specialization changes the instruction itself in the bytecode array, so subsequent executions go directly to the fast path without re-dispatching. Type checks are moved from every execution to the point of specialization, and inline caches carry the results forward.
This is the type information CPython’s JIT depends on. When the JIT records a trace, it reads the specialization state of the instructions in that trace to understand what types have been observed at each operation site. The 25% average speedup Python 3.11 delivered over 3.10 came from specialization alone, before any JIT existed. The specializing interpreter is not a JIT warm-up heuristic; it is the observation layer that makes JIT compilation viable at all.
3.12 and 3.13: Building the Tier-2 Layer
The JIT does not compile Python bytecode directly. It compiles micro-ops, or uops, produced by the tier-2 optimizer from hot traces. PEP 744 shipped the JIT backend in Python 3.13, but the tier-2 optimizer was developed alongside it and is what makes the backend viable.
The tier-2 optimizer takes a recorded trace, typically a loop body, and runs analysis passes over the resulting uop sequence: constant folding, type propagation from the specialization caches, guard elision, and dead code removal. A guard is a runtime check that verifies a type assumption before proceeding; failing a guard means deoptimizing back to the interpreter. The optimizer’s job is to eliminate as many guards as it can by proving that type conditions already established earlier in the trace make later checks redundant.
The quality of the JIT backend’s output is bounded by the quality of the optimizer’s output. A trace full of redundant type guards generates cluttered native code regardless of how good the code generator is. The tier-2 optimizer’s role is to ensure that the uop sequence reaching the backend is as clean as possible, with type information propagated forward and unnecessary branches removed.
3.13: The Copy-and-Patch Backend
The JIT backend that landed in Python 3.13 uses copy-and-patch, a technique developed for .NET’s runtime by Xu et al. in 2021. The key property is that LLVM runs at CPython’s build time, not at Python runtime. When you compile CPython, Clang compiles C implementations of each uop into machine code stencils: byte arrays with holes where runtime-specific values like addresses and jump targets will go. Those stencils are embedded in CPython’s binary and are present before any Python code executes.
At runtime, JIT compilation means copying stencils into executable memory and patching the holes. The compilation cost per trace is measured in microseconds. This is how CPython avoids the failure mode that destroyed Unladen Swallow: Google’s 2009-2011 attempt linked LLVM into CPython’s runtime and ran the full optimization pipeline per function, accumulating startup and memory overhead that Python’s typical workloads could not amortize. Copy-and-patch keeps LLVM out of the runtime entirely.
The 3.13 JIT was experimental and off by default. On most pyperformance benchmarks, it ran slower than the interpreter. That outcome was predictable given one specific structural problem.
The Structural Problem
Each stencil was a self-contained unit. It read its inputs from the Python evaluation stack, a C array in the frame object, and wrote its outputs back to the same stack. No stencil could pass a live value to the next stencil through a CPU register, because Clang compiled each stencil without knowledge of what adjacent stencils would do.
The result: in a loop body built from several arithmetic uops, every value crossing a uop boundary made a round trip through memory. The JIT eliminated interpreter dispatch overhead and simultaneously introduced memory round-trip overhead. On many benchmarks, these cancelled out.
The tier-2 optimizer computed liveness information: it knew which values flowed directly from one uop to the next, which were consumed immediately, which persisted across multiple steps. That information existed in the IR and went unused. The backend had no mechanism to translate liveness knowledge into register assignments across stencil boundaries, so the analysis was discarded at emit time.
3.14: Cleaning Up the Traces
Python 3.14 improved the tier-2 optimizer’s type propagation, extending it through more operation sequences and across loop back-edges. More guards were eliminated before traces reached the JIT. This did not fix the register spill problem, but it reduced the clutter around it: cleaner traces meant fewer stencils per loop body, fewer boundary crossings, and fewer redundant memory round-trips from guard-related operations.
This incremental improvement mattered because it validated the optimizer pipeline and produced the cleaner uop sequences that register allocation in 3.15 would depend on. Register allocation applied to guard-heavy, cluttered traces delivers less benefit than the same allocator applied to clean traces. The 3.14 investment was not wasted; it was establishing the conditions for 3.15’s work to have maximum impact.
3.15: Closing the Last Gap
Python 3.15 adds a linear-scan register allocator that processes the uop sequence before stencil emission. Linear-scan register allocation is standard compiler technique: scan uops in order, track which values are live at each point using liveness intervals, assign physical registers to live values, and spill to stack when register pressure exceeds the available count. It runs in O(n) time on the trace length, keeping compilation latency compatible with copy-and-patch’s fast warm-up model.
Implementing this required extending the stencil format. The existing hole mechanism handled runtime constants and addresses; 3.15 adds hole kinds for register operand encodings. At CPython build time, the stencil compiler emits additional holes wherever an instruction operand register appears in the encoding. At JIT time, the allocator’s register assignments get patched into those holes alongside the usual constant and address patches.
Early benchmark results from the development work show 5 to 15 percent improvements over the non-JIT baseline on typical pyperformance workloads, and 20 to 30 percent on compute-heavy benchmarks like nbody and spectral_norm. The compound effect of 3.14’s cleaner traces and 3.15’s register allocator is what the 3.13 architecture was designed to eventually deliver.
What the Pipeline Still Does Not Do
Even with values staying in registers across loop uops, the addition of two Python integers still calls _PyLong_Add on heap-allocated PyLong objects. Unboxing integers to raw machine words requires escape analysis to prove a value never needs to be observed as a Python object; that work is ongoing and extends past 3.15.
PyPy performs this unboxing and achieves 5 to 10 times faster integer loops than CPython. CPython’s constraint is maintaining full compatibility with the C extension ecosystem: millions of packages with C extension components depend on CPython’s object layout in ways that PyPy’s alternative object model cannot satisfy without compatibility shims. CPython’s JIT is not competing with PyPy on numeric benchmarks; it is trying to make CPython meaningfully faster than CPython’s own interpreter on general-purpose Python code, while preserving that compatibility guarantee.
For that narrower goal, 3.15 is the first release where the answer to “should I run the JIT?” is probably yes for most workloads rather than no for most workloads. The pipeline from specializing interpreter to tier-2 optimizer to copy-and-patch backend to register allocator is, for the first time, complete enough to run end-to-end and produce net positive results. Ken Jin’s post is not announcing a breakthrough; it is reporting that a carefully sequenced construction project finally has all its pieces in place.