· 6 min read ·

From Stencils to Registers: What CPython's JIT Needed to Matter

Source: hackernews

CPython’s JIT compiler has been in the codebase since 3.13, but for two major releases it has been more of a structural commitment than a practical speed improvement. The latest update from Ken Jin, one of the core contributors to the JIT work, explains why 3.15 is different: register allocation is landing, and that changes the quality of the generated code in ways the earlier releases could not.

What the Copy-and-Patch JIT Actually Does

The design choice at the core of CPython’s JIT is called copy-and-patch, a technique originally developed for the .NET runtime and described in a 2021 OOPSLA paper by Xu et al. The idea is to avoid writing a full compiler backend from scratch. Instead, each micro-op handler is compiled by LLVM at CPython build time into a machine code template called a stencil. The stencil is a pre-compiled native code blob with “holes” left for values that vary at runtime: pointer addresses, constants, jump targets. When the JIT runs during normal Python execution, it copies the relevant stencils into an executable memory buffer and fills in those holes with actual values. The resulting native code is cached and called directly.

This approach has real advantages. Architecture-specific work is handled by LLVM at build time, so the JIT backend is portable across x86-64 and ARM64 without a bespoke code generator for each platform. JIT compilation itself is fast, since stitching together pre-compiled stencils is just memory copies and pointer writes. The implementation stays manageable for a small team.

The limitation becomes apparent when you look at what happens at the boundary between stencils. Because each stencil is compiled independently, a value produced by one stencil and consumed by the next cannot simply remain in a register. It must be written to memory at the end of the first stencil and read back at the start of the second. Every uop boundary becomes an implicit spill and reload, even when the value is immediately consumed by the following operation. This is not a subtle inefficiency; in a tight numeric loop, it can mean the majority of memory traffic in the generated code is serving these inter-stencil handoffs rather than doing real computation.

What Register Allocation Changes

Register allocation, at its core, is the process of deciding which values live in which registers at each point in a program. In a traditional optimizing compiler, this happens as a global pass over the intermediate representation before code generation. Values that are live across multiple operations stay in registers; values that exceed register file capacity get temporarily spilled to the stack.

For the stencil JIT, adding register allocation means running this analysis over the entire uop trace before generating the stencil sequence. Given a sequence of micro-ops, the allocator determines which values cross stencil boundaries and assigns them to physical registers rather than memory slots. The stencils themselves need to support this: instead of always reading and writing from a fixed interpreter frame struct, they carry additional holes for register operands. The build-time stencil compilation produces parameterized templates that can be patched with the correct register assignments at JIT compile time.

The gain is proportional to how many values flow between uops in the trace. In arithmetic-heavy code, virtually every computed value is immediately consumed by the next operation. Without register allocation, each of those intermediate values touches memory unnecessarily. With it, the values stay in registers and the redundant load/store instructions disappear entirely.

Why Three Releases In Is the Right Time

When CPython 3.13 shipped the JIT as an experimental opt-in feature, the honest framing from the core team was that correctness and stability mattered more than speed. The stencil approach, without register allocation, intentionally left performance on the table in exchange for an implementation simple enough to validate and iterate on. Benchmarks from the 3.13 JIT showed roughly neutral performance versus the non-JIT interpreter on the pyperformance suite, with some microbenchmarks showing modest gains and others showing slight regressions from compilation overhead.

Python 3.14 improved the Tier 2 optimizer, the component that sits between the adaptive specializing interpreter and the JIT backend. Better guard elimination and improved type propagation through the uop intermediate representation reduced redundant runtime type checks in the generated code. That work laid the groundwork for what 3.15 can do: once you know the types flowing through a trace, you can make smarter register allocation decisions and, eventually, skip allocating PyObject* heap wrappers for primitive values entirely.

The trajectory mirrors how other JIT implementations evolved. V8’s compiler pipeline went through a similar arc: first get correct code generation working in Crankshaft, then build out register allocation and type specialization in Turbofan, then add escape analysis and further loop optimizations on top of that foundation. PyPy’s RPython meta-JIT had register allocation from early on, but paid for that maturity with years of development work that a small team maintaining the reference implementation could not afford to replicate upfront.

CPython’s JIT is not trying to match PyPy on CPU-bound numeric benchmarks in the near term. PyPy’s tracing JIT, with mature register allocation, unboxing of primitive types, and escape analysis, is still several times faster than CPython on those workloads. The goal is incremental improvement on a stable foundation.

What the Performance Numbers Show

Early benchmarks from the 3.15 development work, as described in Ken Jin’s post, show 20-40% improvements on tight numeric loops compared to the non-JIT interpreter. That range of improvement suggests the register allocation gap was a genuine architectural bottleneck, not a marginal inefficiency. When you eliminate a whole category of unnecessary memory traffic from a hot loop, the gains scale with how arithmetic-intensive the loop is.

The pyperformance benchmark suite will show more modest gains. It is deliberately diverse, including I/O-bound benchmarks, library-heavy benchmarks, and startup-dominated cases where the JIT has no opportunity to warm up. Gains in the 5-15% range on that suite would represent meaningful progress.

What Comes After Register Allocation

The two capabilities that would yield the next large performance improvements are unboxing and cross-call inlining.

Unboxing means keeping Python integers and floats as raw machine integers and floating-point values in registers rather than as heap-allocated PyObject* pointers that must be dereferenced to access the actual numeric value. Most of CPython’s overhead on arithmetic-heavy code comes from this pointer indirection and the reference counting traffic around it. Type inference, which 3.14 advanced considerably through abstract interpretation over the uop IR, is a prerequisite: you can only unbox a value if you can prove it will always be an integer, never an arbitrary Python object. Register allocation is also a prerequisite, since unboxed values need somewhere to live across uop boundaries that is not a PyObject* slot in the frame struct.

Inlining across call boundaries is harder and sits further out on the roadmap. Right now the JIT optimizes within a single trace, but Python function calls create new frames that are not visible to the optimizer. This limits what the type propagation and guard elimination can see and forces values to pass through the full frame calling convention.

The Broader Context

For most Python code running today, the JIT still does not help. Functions called fewer than a few hundred times are never compiled; library-heavy code spends its time in C extensions the JIT cannot touch; startup-dominated scripts see no benefit from a compiler that needs warmup. These constraints are inherent to the tracing JIT model, not specific to CPython’s implementation.

What distinguishes the 3.15 work is that register allocation is a concrete, measurable improvement to generated code quality, not an incremental cleanup. Python has had JIT attempts before, most of which lived only in forks or were eventually abandoned. What makes the current work more durable is that it is building on two years of proven infrastructure in the mainline interpreter, with the copy-and-patch foundation providing a stable base to iterate on. The register allocation work does not require redesigning the stencil system; it extends it in a targeted way.

That incrementalism is probably the only realistic path for a project that must ship a reliable interpreter to millions of users on every release cycle. The 3.15 JIT will not make Python competitive with PyPy on scientific computing benchmarks, but it will make the JIT something closer to what it was always described as: a foundation for future optimization work, rather than a foundation that needed its own foundation fixed first.

Was this interesting?