· 6 min read ·

The Missing Piece in CPython's JIT: Register Allocation and the Road to Python 3.15

Source: lobsters

The JIT that shipped with CPython 3.13 was an engineering achievement and a performance disappointment at the same time. Benchmarks showed it roughly neutral on real workloads, sometimes slightly negative. A year later, Ken Jin’s post on the state of CPython’s JIT explains why that happened and what has changed heading into 3.15. The short version: the underlying architecture was sound, but a missing piece of compiler infrastructure was turning good design into mediocre machine code. That piece is a register allocator, and its arrival changes the picture considerably.

Three Tiers of Execution

Understanding what went wrong requires understanding what CPython actually built over the last four years. The interpreter today is a three-tier system, each tier built on top of the previous one.

Tier 1 is the specializing adaptive interpreter that landed in Python 3.11 under PEP 659. The “Faster CPython” project, funded by Microsoft and led by Mark Shannon and Guido van Rossum, instrumented the interpreter to count how many times each instruction runs. When an instruction goes hot, the interpreter replaces it with a specialized version that assumes the types it has seen. A generic LOAD_ATTR becomes LOAD_ATTR_MODULE for a known module attribute, eliminating type checks on every call. This alone delivered roughly 25% improvement on the pyperformance benchmark suite in 3.11.

Tier 2 introduced a lower-level intermediate representation: a sequence of micro-operations called uops, which decompose bytecode instructions into smaller primitives. A BINARY_OP instruction becomes _LOAD_FAST, _BINARY_OP_ADD_INT, _STORE_FAST, and so on. The tier 2 optimizer runs passes over this uop graph: type propagation, dead-store elimination, guard removal. Crucially, uop traces cross basic-block boundaries, enabling the optimizer to inline across call sites and eliminate branches that the bytecode representation would have kept opaque.

Tier 3 is the JIT itself. When a uop sequence has been optimized, it gets compiled to native machine code using the copy-and-patch technique described by Xu et al. at OOPSLA 2021. Brandt Bucher adapted this for CPython: at build time, LLVM compiles template stubs for each uop to machine code, leaving “holes” for runtime values like immediate constants and target addresses. At runtime, JIT compilation is simply memcpying the relevant stubs into an executable buffer and patching the holes with actual values. No LLVM in the running process, no heavyweight compiler framework as a dependency. JIT compilation takes microseconds.

The Register Problem

The copy-and-patch approach is fast to implement and fast to execute JIT compilation, but it has a structural weakness: without a register allocator, the JIT has no way to keep values in CPU registers across uop boundaries. Each uop stub was written assuming it loads its inputs from memory (the Python frame’s value stack or local variable slots) and stores its outputs back to memory. Even after the tier 2 optimizer has proven that a value flows directly from one uop to the next, the emitted machine code still bounces it through a stack slot.

On modern x86-64 hardware, a register-to-register move is effectively free; a round-trip through the L1 cache costs several cycles, and a cache miss costs far more. In a tight loop over integers, this memory traffic overwhelms the gains from eliminating the interpreter dispatch overhead that the JIT was supposed to remove in the first place. The optimizer knew the types, removed the guards, proved the control flow; the code generator then spent those savings on unnecessary loads and stores.

This is not an unusual situation in compiler development. Liveness analysis and register allocation are taught as late-stage problems in compiler courses precisely because they sit after all the interesting semantic transformations. You build the IR, you optimize the IR, and then you discover that without register allocation the emitted code is slower than expected. CPython hit this wall in exactly the way the textbooks describe.

The 3.15 work adds a proper liveness analysis pass and a register allocator to the JIT backend. The optimizer can now communicate to the code generator which values are live across uop boundaries, and the code generator can keep them in registers. This turns the tier 2 optimizer’s type and liveness information into actual machine-code savings rather than metadata that goes unused at emit time.

Copy-and-Patch vs. the Alternatives

CPython’s approach is worth contrasting with what PyPy and V8 do, because the tradeoffs are instructive.

PyPy uses a meta-tracing JIT built in RPython. It traces loop iterations at runtime, builds a linear trace of the hot path including all the interpreter’s own operations, and compiles that trace through a full optimizing compiler backend. On compute-bound Python code, PyPy regularly runs 3 to 10 times faster than CPython. The cost is warmup time, memory overhead, and extraordinary engineering complexity; the PyPy JIT is a complete compiler toolchain embedded in the runtime.

V8 uses a three-tier system of its own: the Ignition bytecode interpreter, the Maglev mid-tier JIT, and Turbofan as the full optimizing compiler. V8 has been refining this pipeline for over a decade, with teams of engineers at Google continuously improving speculative optimization, deoptimization bailouts, and code cache management.

CPython’s copy-and-patch JIT is closer in spirit to a method JIT than a tracing JIT. It compiles superblocks derived from the tier 2 uop traces, not full loop iterations in the PyPy sense. The peak code quality is lower than what PyPy or a mature Turbofan-style compiler produces, but the implementation is small enough to live in CPython’s own codebase, requires no runtime compiler dependency, and compiles code fast enough that warmup is measured in microseconds rather than seconds.

The bet the CPython team has made is that for a language used primarily for glue code, data processing scripts, and web backends, a lightweight JIT with fast warmup and low overhead is more broadly valuable than a high-peak-performance JIT with expensive warmup. That bet looks defensible for many workloads, but it does mean that pure compute benchmarks will continue to show a significant gap between CPython and PyPy for the foreseeable future.

What 3.15 Targets

Ken Jin’s post frames the 3.15 goal around enabling the JIT by default. Python 3.13 shipped the JIT as an opt-in build flag (--enable-experimental-jit); the intent for 3.15 is for it to be on by default and to deliver a net-positive result on real workloads without users having to think about it.

The register allocator is the primary enabler, but not the only change. The tier 2 optimizer is gaining additional passes, and the uop IR is being extended with more precise type information to enable finer-grained specialization. The combination is intended to push pyperformance numbers meaningfully above what 3.14 shows without the JIT.

The pyperformance suite is a mixed workload that includes a lot of I/O-bound and call-heavy benchmarks where the JIT has limited impact; raw gains on compute-heavy code will be larger. The team has discussed targeting a further 10 to 30 percent improvement over 3.14 on pyperformance once the JIT is on by default, with much larger gains on CPU-bound microbenchmarks.

The Incremental Approach in Practice

What CPython is doing with its JIT is building compiler infrastructure incrementally inside a 35-year-old codebase that cannot afford to break the world while it does so. The copy-and-patch tier went in as experimental infrastructure in 3.13, discovered real-world constraints through 3.14, and is being fixed properly for 3.15. That is not a failure of planning; it is how large, conservative open-source projects build irreversible infrastructure.

The CPython benchmark dashboard at speed.python.org shows where the numbers actually land for each commit. The trajectory since 3.11 has been steady upward. The register allocator landing for 3.15 will be one point on that chart, and whether it closes the gap with PyPy in any meaningful way is a separate question. The honest answer is: not fully, not yet, and probably not without a more aggressive optimizer. For a default-on JIT that requires no GraalVM installation or PyPy warmup period, though, steady improvement on a well-understood architecture is a reasonable thing to ship.

Was this interesting?