Before CPython's JIT Could Matter, It Needed a Tier-2 Optimizer

When Python 3.13 shipped with a JIT compiler, the reception was enthusiastic; the benchmarks that followed were less so. The JIT, when enabled, ran at roughly the same speed as the interpreter, sometimes marginally slower. For something that required Clang at build time and years of development work, that was a deflating result, and it demanded an explanation.

The article making the rounds this week argues that Python 3.15 is where the JIT finally gets the infrastructure it needed. To understand why that matters, it helps to look at how CPython’s execution model is actually layered, because the JIT is not a standalone component. It is the top tier of a three-tier system, and the tiers below it have to work before the top tier can deliver anything.

Three Tiers of Execution

CPython 3.13 introduced what the core team calls a three-tier execution model, though the documentation rarely describes it in those terms.

Tier one is the Specializing Adaptive Interpreter, introduced in Python 3.11. As bytecodes execute repeatedly with consistent types, the interpreter replaces generic instructions with specialized variants: LOAD_ATTR becomes LOAD_ATTR_SLOT, BINARY_OP becomes BINARY_OP_ADD_INT. Each specialization is guarded; if the type assumption breaks, execution falls back to the generic form. This alone delivered the bulk of the speedups in 3.11 and 3.12.

Tier two is the micro-op optimizer, also called the tier-2 optimizer. When a sequence of specialized bytecodes executes enough times, CPython can translate it into a sequence of lower-level micro-ops (_Uop instructions) and run optimization passes over that sequence before handing it off for compilation. This is where type inference, guard elimination, and dead code removal happen. Mark Shannon has been the primary architect of this layer.

Tier three is the JIT code generator: Brandt Bucher’s copy-and-patch compiler, which takes the optimized micro-op sequence from tier two and emits native machine code by copying pre-compiled stencils and patching in concrete values.

The problem in 3.13 was that tier two was not complete. The framework existed, but the optimization passes that make it useful were not there. The JIT was receiving micro-op sequences that had not been optimized and generating code from them directly, which meant it was essentially concatenating pre-compiled code fragments with no cross-fragment optimization. Tier three cannot compensate for what tier two fails to do.

What Unoptimized Traces Cost

Copy-and-patch works by pre-compiling machine code stencils at build time using LLVM, then at runtime copying those stencils into a buffer and patching the holes with runtime values: addresses, constants, type pointers. Each stencil corresponds to one micro-op. The generated code is a concatenation of stencils, with no whole-trace compilation step.

Without a register allocator operating across stencil boundaries, every stencil spills its outputs to the stack and the next stencil reloads them. The generated code for a tight numeric loop looks roughly like this:

; what copy-and-patch produces without cross-stencil register allocation
mov  rax, QWORD PTR [rsp+8]   ; reload left operand
mov  rcx, QWORD PTR [rsp+16]  ; reload right operand
add  rax, rcx
mov  QWORD PTR [rsp+8], rax   ; spill result for next stencil

A compiler with a register allocator keeps the values alive:

; what a register allocator enables
add  rax, rcx

On a loop running millions of iterations, the difference in load/store pressure is measurable. The spills are not expensive individually, but they consume instruction throughput and occupy memory ports that could be doing arithmetic.

Without type inference propagating across the trace, the situation is similarly wasteful. A guard at position three in the trace may establish that a value is a Python integer, but the stencil at position seven will check again independently. In a tight loop, the same type check may run four or five times per iteration where once would suffice. Beyond redundant checks, the absence of type information prevents unboxed arithmetic: if the JIT cannot prove that both operands of an addition are small integers, it cannot skip the object header and do the math directly on the raw values.

What 3.15 Fixes

The 3.15 work, as described in the article, centers on completing the tier-two optimizer with two capabilities.

First, abstract type inference over the micro-op sequence. Before code generation, an analysis pass walks the trace forward and propagates what is known about each value: is it an integer, a float, a known constant, a reference that is guaranteed to stay alive? This information feeds into guard elimination and eventually into code generation decisions. A guard whose precondition is already established by an earlier guard in the same trace can be removed. A value known to be an integer can be handled with integer-specific code paths.

Second, a register allocator that operates across stencil boundaries. This requires extending the stencil system, since the stencils are pre-compiled with fixed register encodings. The solution involves either generating multiple stencil variants for different register assignments or adding a relocation scheme flexible enough to rewrite register encodings at patch time. This is the architecturally expensive part of the work, which is why it was not in 3.13.

The faster-cpython team’s design documents have flagged register allocation as the primary unlock for meaningful JIT speedups since before 3.13 shipped. The 3.13 JIT was always intended as a foundation, not a finished product. Bucher’s stated goal was to get the copy-and-patch infrastructure into the tree and iterate. That was the right call; the alternative was a much larger and riskier initial change.

Where This Leaves the Performance Trajectory

The realistic comparison here is not PyPy. PyPy’s tracing JIT has been production-grade for over a decade and includes escape analysis, method inlining, and backend optimizations that CPython is not attempting in 3.15. PyPy can be two to ten times faster than CPython on CPU-bound benchmarks that stay in Python. Closing that gap is not the near-term goal.

The near-term goal is a positive result on the pyperformance benchmark suite with the JIT enabled, something consistent enough to justify enabling it by default in a future release. The 3.13 JIT did not achieve that; the geometric mean across pyperformance was roughly flat or slightly negative. With register allocation and type inference completing the tier-two optimizer, the expectation is a measurable positive result, likely in the range of ten to twenty percent on benchmarks that are CPU-bound and spend significant time in hot Python loops.

Code that is I/O bound or that spends most of its time in C extensions will see little change regardless. The JIT is a last-mile optimization for the Python-executing portion of the runtime, not a transformation of the whole system.

For Python developers, the practical signal from 3.15 is whether the JIT moves from being an experimental build flag to something that ships enabled. If the register allocator and type inference land and perform as expected, that transition becomes justifiable. After two releases of a JIT that technically existed but did not help, that would be a meaningful change in what CPython actually delivers.