· 6 min read ·

Getting to Break-Even: The Real Engineering Story Behind CPython's JIT

Source: hackernews

When Python 3.13 shipped with an experimental JIT compiler behind a build flag, benchmarks on pyperformance came back near-neutral, sometimes mildly negative. The internet treated this as a disappointment. It was not. It was the predictable outcome of a multi-year architectural choice working out as designed, with one large piece still missing.

That missing piece arrives in Python 3.15, and Ken Jin’s write-up explains what changed. The short answer is register allocation. The long answer involves understanding what CPython’s JIT actually does and why the benchmarks in 3.13 were telling the truth about a real constraint, not revealing a design failure.

The Three Tiers You Are Actually Benchmarking Against

Before looking at the JIT itself, it helps to understand what it competes against. CPython has run a two-tier adaptive interpreter since Python 3.11 via PEP 659. The first tier instruments bytecode execution, counts how often each instruction runs, and after roughly eight executions replaces generic opcodes with type-specialized variants. A generic BINARY_OP becomes BINARY_OP_ADD_INT once the interpreter has seen integer inputs enough times. A LOAD_ATTR on an instance becomes LOAD_ATTR_INSTANCE_VALUE with an inline cache that bypasses dictionary lookup entirely.

That 3.11 specialization was responsible for roughly 25% of the speedup attributed to the Faster CPython project. It ran without any JIT, purely by rewriting instructions in place as types became predictable.

The Tier 2 optimizer, added in 3.13 alongside the JIT, decomposes these specialized bytecodes into micro-operations (uops). BINARY_OP_ADD_INT becomes _GUARD_BOTH_INT followed by _BINARY_OP_ADD_INT. The optimizer traces these uop sequences across basic block boundaries, runs type propagation and guard elision passes, and feeds the resulting superblocks to the JIT backend.

The JIT therefore competes against an interpreter that is already type-specialized, already running guard-elided instruction variants, and already operating with inline caches. That is a much harder baseline to beat than naive CPython.

Copy-and-Patch: Fast Compilation, Constrained Output

CPython’s JIT uses the copy-and-patch technique from a 2021 OOPSLA paper by Xu et al., previously used in .NET expression trees and parts of LuaJIT. The mechanism has two phases.

At CPython’s build time, Clang compiles C implementations of each uop into native machine code templates called stencils. These stencils contain “holes,” which are placeholders for values known only at runtime: object addresses, jump targets, cache offsets, immediate constants. LLVM is involved only at build time; it is not shipped as a runtime dependency.

At runtime, JIT compilation means copying a sequence of stencil bytes into an executable memory buffer and patching the holes with concrete values. No IR construction, no optimization passes, no register allocation at runtime. Compilation takes microseconds per trace.

The payoff is clear: you can JIT extremely short-lived traces without paying the warmup cost that would make a traditional compiler like LLVM unprofitable. The cost is equally clear: each stencil was compiled in isolation. The code generator has no visibility across uop boundaries at runtime.

The Register Problem

This is where 3.13’s benchmarks told the truth. Each uop stencil was written to load its inputs from memory (the Python frame’s value stack or local variable array) and write its outputs back to memory. Even when the Tier 2 optimizer could prove that the output of one uop fed directly into the input of the next, the emitted native code still bounced that value through a stack slot.

On x86-64, a register-to-register move has no meaningful cost. An L1 cache access costs a few cycles. An L1 miss costs considerably more. In a loop where the JIT was eliminating interpreter dispatch overhead on one side, it was reintroducing memory round-trips on the other. The two costs largely cancelled each other out, which is exactly what the near-neutral benchmarks showed.

This was not a surprise to CPython’s developers. The copy-and-patch architecture, as originally implemented, had no mechanism to keep values in registers across uop boundaries. You can think of it as a code generator missing its final lowering pass: the liveness analysis that tells the backend which values are live at each program point and which registers are therefore available for allocation.

The optimizer was also insufficiently aggressive at guard elision across longer traces. Guards that remained in hot paths added conditional branches, increasing branch predictor pressure and code size. Even with type specialization already handled by Tier 1, uop sequences retained guards the 3.13 optimizer could not yet prove redundant.

The free-threaded build introduced in 3.13 as PEP 703 compounded the issue. Thread-safe reference count handling in JIT-emitted code added extra branching absent from the single-threaded path, compressing the already-thin margin the JIT had over the adaptive interpreter.

What 3.15 Adds

Python 3.15 adds a register allocator to the JIT backend. The optimizer now performs liveness analysis across uop boundaries and communicates which values remain live through the compiled sequence. The code generator can then keep live values in CPU registers instead of spilling them to memory between uops.

This completes the standard three-phase compiler sequence: build an IR, optimize the IR, lower to machine code with register allocation. CPython built phases one and two across 3.13 and 3.14. Phase three arrives now.

Alongside the register allocator, 3.15 includes stronger abstract interpretation passes that propagate type information across longer traces, fire guard elision more frequently, and reduce the number of conditional branches remaining in hot paths. Guard checks the Tier 2 optimizer could not previously prove redundant become eliminable with finer-grained type tracking across loop iterations, not just within a single trace. JIT code allocation has also been revised for better spatial locality, grouping related traces and aligning to cache-line boundaries to reduce instruction cache pressure.

The combination targets a realistic 10 to 30 percent improvement over 3.14 on pyperformance with the JIT enabled, with larger gains on CPU-bound pure-Python microbenchmarks. The JIT is also planned to ship enabled by default in 3.15, rather than as an opt-in build flag.

The Strategic Context: Why Not Just Use PyPy

PyPy’s meta-tracing JIT has delivered 4 to 10x speedups on CPU-bound pure Python for well over a decade. It achieves this by tracing the interpreter itself, compiling full loop iterations including all type-specialized paths, and running an optimizing backend with escape analysis, integer unboxing, and aggressive inlining. It is a significantly more powerful JIT.

It also requires a separate runtime with partial C extension compatibility, several seconds of warmup on non-trivial programs, and an ecosystem that has never fully aligned with the CPython mainstream. For workloads dominated by NumPy, C extensions, or short-lived scripts, PyPy’s peak throughput delivers little practical benefit.

CPython’s bet is different. Copy-and-patch with microsecond compilation latency can profitably JIT traces too short for LLVM to even consider. Full C extension compatibility is non-negotiable because the Python ecosystem is built on it. The ceiling is lower than PyPy’s ceiling, but the addressable surface is broader.

The V8 comparison is instructive in a different direction. V8’s three-tier pipeline (Ignition, Maglev, TurboFan) is conceptually similar to CPython’s layered approach, but TurboFan is a full SSA-based optimizing compiler with speculative optimization and deoptimization bailouts, backed by years of dedicated engineering. CPython’s Tier 3 is deliberately conservative by comparison, and that conservatism is a feature rather than a limitation, because it preserves the existing runtime model without requiring deoptimization infrastructure.

What the Register Allocator Unlocks Going Forward

The register allocation fix matters not just for its direct speedup but for what it makes worthwhile afterward. Once the JIT baseline is reliably positive, each optimizer improvement compounds on a foundation that is not already spending its gains on memory traffic. Longer superblocks, inter-iteration type inference, and limited escape analysis on short-lived stack objects all have higher payoff when the code generator can hold live values in registers throughout a trace.

The trajectory of the CPython benchmark dashboard since 3.11 has been steadily upward, driven by specialization and inlining improvements at each release. The 3.15 JIT work is not a separate track; it is the point where the Tier 3 backend catches up to the quality of analysis that Tier 2 has been accumulating. Once that alignment exists, further optimizer investment in abstract interpretation and guard elision has a direct path to machine code that will use it.

CPython’s JIT was always going to hit the register allocation wall after copy-and-patch shipped. The three-tier architecture was designed to be built incrementally, with experimental infrastructure landing early and refinement following as real-world constraints became measurable. Python 3.15 is when that refinement arrives at the right place, and the benchmarks should start reflecting it clearly.

Was this interesting?