· 8 min read ·

CPython's JIT Was Designed to Ship Slow, and Three Releases Later That Looks Correct

Source: hackernews

Python 3.15’s JIT reaching a point where it can genuinely outperform the interpreter it sits on top of deserves more context than “the register allocator finally landed.” The more interesting question is why CPython’s team shipped a JIT that was slower than the interpreter for two consecutive releases, and whether that was a mistake or a deliberate architectural decision with a specific payoff timeline.

Ken Jin’s post on the JIT being back on track describes the technical progress. The design philosophy that produced a JIT shipping as experimental in 3.13, remaining so in 3.14, and approaching production quality in 3.15 is worth examining on its own terms, because it reflects a specific lesson the CPython team drew from prior failed attempts.

The Constraint That Shaped Everything

The original decision that shaped the entire architecture of CPython’s JIT was a decision not to use LLVM at runtime.

Unladen Swallow, Google’s 2009-2011 attempt to speed up CPython, used LLVM as a JIT backend. It compiled Python bytecode to LLVM IR and ran LLVM’s optimization pipeline at runtime. The theory was plausible: LLVM is a mature, well-tested optimizing compiler capable of dead code elimination, register allocation, and loop vectorization. Why write all of that from scratch?

The theory ran into Python’s dynamic semantics. LLVM’s optimization passes extract value from knowing that an integer is an integer, a pointer is a pointer, and function calls have bounded side effects. Python guarantees none of these things. The passes ran and found little to optimize. Beyond performance, LLVM itself was a substantial runtime dependency: it needed to be linked into CPython, it had its own startup and memory overhead, and managing a running LLVM instance alongside CPython’s garbage collector created subtle interactions that were hard to contain. The project was abandoned in 2011 having never merged into CPython.

PEP 744, the JIT PEP that shipped in CPython 3.13, was designed around that failure mode. The copy-and-patch approach, borrowed from a 2021 OOPSLA paper by Xu Chen et al. that described .NET’s bytecode compilation technique, uses LLVM only at build time. Offline, when CPython is compiled, LLVM compiles each micro-operation into a machine code stencil: a template with holes where runtime-specific values like addresses, constants, and branch targets will go. Those stencils are embedded in the CPython binary. At runtime, JIT compilation copies stencils into executable memory and patches the holes. No LLVM at runtime, no startup cost, no dependency to manage.

This is the correct response to Unladen Swallow, and the tradeoff it introduces is precisely what kept the 3.13 JIT from delivering benchmark improvements.

What Stencil Boundaries Cost

When LLVM compiles a stencil, it sees exactly one micro-operation in isolation. It can optimize within the stencil, choosing good instruction sequences and constant-folding values known at build time, but it cannot optimize across stencil boundaries, because those boundaries represent decisions made at JIT-compile time. In particular, it cannot know which registers will be live when execution enters the stencil, because that depends on what the preceding stencil did.

The consequence is that every value crossing a stencil boundary must travel through memory. A stencil computes a result, writes it to a stack slot, and falls through. The next stencil loads from that stack slot. Store-load forwarding on modern CPUs handles this without full cache latency, but the stores and loads still consume execution ports and expand the instruction footprint of the emitted code.

Consider a simple accumulation loop:

def sum_range(n):
    total = 0
    for i in range(n):
        total += i
    return total

A traditional optimizing JIT with global register allocation compiles the inner loop body to something close to four instructions: load total and i from registers, add them, increment i, compare against n, and branch. Values stay in the register file across the entire loop body.

The 3.13 JIT, stitching stencils without inter-stencil register awareness, emits code where total and i are loaded from stack slots at the start of each uop and stored back at the end. Each micro-operation that references a value adds a load; each micro-operation that produces a value adds a store. The loop body expands considerably. The CPU’s out-of-order execution engine and store-load forwarding buffer hide some of the cost, but not all of it.

This is not a bug in the implementation. It is the direct cost of the design constraint: isolated stencils compiled offline, enabling fast runtime JIT compilation without LLVM, at the expense of inter-operation register sharing.

Why the Tier-2 Optimizer Matters Here

The JIT does not operate directly on Python bytecode. It sits at the top of a three-tier system: the Specializing Adaptive Interpreter from Python 3.11 handles tier-1 type specialization, the micro-op optimizer handles tier-2 trace analysis and optimization, and the JIT backend handles tier-3 native code generation from the optimized trace.

The tier-2 optimizer runs type propagation, guard elimination, dead code removal, and constant folding over the micro-op trace before the JIT sees it. Its output quality directly determines JIT code quality: a trace full of redundant type guards means more stencils stitched together, more boundary crossings, more store-load traffic, regardless of whether inter-stencil register allocation exists. The tier-2 optimizer can remove an entire class of that overhead by proving type stability and eliminating the guards that check it.

Python 3.14 improved the tier-2 optimizer substantially, extending type inference across more operation sequences and eliminating more guards before traces reach the JIT. Part of what makes 3.15’s register allocation work worthwhile is that the optimizer now produces cleaner traces for it to work with. Landing register allocation on top of the 3.13 optimizer, which produced noisier traces, would have delivered smaller gains.

Register Allocation for a Stencil JIT

Adding register allocation to a stencil-based JIT is structurally different from how register allocation works in a conventional JIT. In TurboFan, GraalVM, or HotSpot’s C2, the compiler constructs a full intermediate representation for the compiled region, computes live intervals for all values, and runs a global allocator that assigns physical registers to those intervals. Every value that is live across multiple operations stays in a register for its entire lifetime.

For CPython’s stencil model, the tier-2 optimizer already holds the full trace as a data structure, which means liveness information across uop boundaries is computable. The approach Ken Jin describes involves propagating that liveness information into the JIT backend, so that when two adjacent stencils share a live value, the backend can arrange for that value to stay in an agreed-upon register rather than going through a stack slot. The stencils themselves may need variants compiled for different register assignments, which increases the stencil table size but eliminates the systematic memory traffic at boundaries.

This is not full global register allocation as a traditional compiler understands it, but for the workload CPython’s JIT targets, which is hot inner loops with a small number of live values, it addresses the dominant source of overhead.

Why Shipping Experimental Was Correct

Shipping the 3.13 JIT as experimental-and-off-by-default, and keeping it that way through 3.14, draws criticism from observers who expected the feature to be production-ready sooner. That criticism misreads what the experimental releases accomplished.

The 3.13 JIT validated the entire stencil compilation model in production conditions: trace recording, guard semantics, deoptimization paths back to the interpreter, interaction with the garbage collector and the C extension object model, behavior on all supported platforms including x86-64, AArch64, and s390x. Code that ran correctly without the JIT continued to run correctly with --enable-experimental-jit. Users who opted in could evaluate behavior without production risk. The correctness foundation was laid completely before performance optimization began.

The alternative, building full register allocation into the initial implementation before shipping, would have meant shipping a more complex system that had not been validated in production. The history of Unladen Swallow is precisely a story of complexity accumulating before validation, producing a codebase that was hard to debug and harder to merge.

Brandt Bucher and the Faster CPython team made a specific engineering judgment: ship the correctness story first, defer the performance story, and address them sequentially on a validated base. That judgment is visible in the pyperformance results for 3.13, where the JIT with --enable-experimental-jit runs slower than without it, and it is also visible in the fact that 3.15 does not need to revisit the correctness story at all. The work in 3.15 is purely additive performance improvement on top of code that is already known to be correct.

What 3.15 Will and Will Not Fix

Register allocation will have its largest impact on tight numeric loops: integer accumulations, range iterations, short inner loops in data processing code. These are the cases where systematic store-load traffic across stencil boundaries has the highest proportional cost. The pyperformance suite benchmarks like richards, spectral_norm, and scimark should see positive deltas for the first time since the JIT became a build option.

What register allocation does not fix: Python’s boxing overhead. Even when total and i stay in registers across the loop body, total += i in unspecialized Python involves type checks, unboxing both operands to C integers, performing the addition, checking for overflow, and boxing the result back into a Python object. Eliminating that overhead requires the tier-2 optimizer to prove type stability strongly enough to specialize the trace, which is ongoing work that extends past 3.15.

PyPy’s tracing JIT handles boxing by representing small integers as unboxed machine integers in JIT-compiled loops by default, using escape analysis to determine when allocation can be deferred or eliminated. CPython’s JIT will remain behind PyPy on pure numeric benchmarks for some time. The constraint CPython operates under is the C extension ecosystem: the tens of thousands of packages with C extension components built against CPython’s internal object representation cannot be ported to PyPy’s object model without a compatibility shim that itself has overhead. CPython’s JIT does not need to beat PyPy; it needs to make CPython meaningfully faster than its own interpreter on code that does not touch C extensions, and 3.15 is where that becomes plausible for the first time.

The deliberate conservatism of the 3.13 and 3.14 releases looks, in retrospect, like the right sequencing. The register allocation work landing in 3.15 builds on two releases of production validation rather than on an optimistic first-draft design. That is a slower path to positive benchmarks, but a more sustainable one.

Was this interesting?