· 5 min read ·

The Stencil Pipeline: Inside CPython's JIT and the 3.15 Register Allocation Fix

Source: hackernews

CPython’s JIT compiler, introduced experimentally in Python 3.13, works differently from most JITs developers encounter. The post explaining what Python 3.15 changes is worth reading if you follow CPython internals, but the underlying architecture deserves more examination. The 3.15 improvements address a constraint baked into the original design, not a surface-level gap.

The Three-Tier Compilation Pipeline

CPython 3.13 and later runs Python code through three distinct execution tiers.

Tier 1 is the specializing adaptive interpreter. When CPython executes a bytecode instruction many times, it replaces the generic instruction with a specialized variant that assumes specific types. A LOAD_ATTR instruction might become LOAD_ATTR_SLOT once CPython observes that the attribute is always at a fixed slot in the object layout. If the assumption fails, CPython despecializes and falls back.

Tier 2 is the optimizer. Once a region has been specialized and runs often enough, CPython translates the specialized bytecode into a sequence of micro-operations, called uops. These uops are lower-level than bytecode and designed to be easy to analyze. The tier 2 optimizer can propagate type information across uops and eliminate redundant guards.

Tier 3 is the JIT. The JIT takes a trace of tier 2 uops and compiles it to native machine code.

Most discussion of the CPython JIT focuses on tier 3, but tiers 1 and 2 do the semantic work. Type specialization, guard optimization, and trace formation all happen before the JIT sees a single instruction. The JIT’s job is comparatively mechanical: turn an already-optimized uop trace into machine code as fast as possible.

Copy-and-Patch: What It Is and Why CPython Chose It

CPython’s JIT uses a technique from a 2021 OOPSLA paper by Xu et al., Copy-and-Patch Compilation. The core idea is to precompile code templates at CPython build time rather than at runtime.

Each uop corresponds to a “stencil”: a block of machine code compiled by Clang/LLVM when CPython itself is built. The stencil has holes where runtime values go, such as the address of a Python object, an inline cache slot, or a branch target. When the JIT compiles a trace, it copies each stencil in sequence and patches the holes with actual values. No code generation occurs at runtime in the traditional sense; only memory copies and integer writes.

This compiles traces in microseconds. A traditional optimizing JIT like V8’s Turbofan or the JVM’s HotSpot C2 compiler generates code from scratch at runtime, which produces excellent output but takes milliseconds per compilation unit. Copy-and-patch trades code quality for compilation speed, keeping warm-up time negligible.

Consider a uop like BINARY_OP_ADD_INT. Its stencil looks conceptually like this:

// Stencil for BINARY_OP_ADD_INT
left  = load [frame + HOLE_left_offset]
right = load [frame + HOLE_right_offset]
// overflow check
result = left + right
store [frame + HOLE_result_offset], result

The HOLE_* values are concrete stack offsets patched in at trace-compile time. LLVM compiled the arithmetic during the CPython build. The loads and stores, though, happen at every stencil boundary, and that turns out to matter.

Why Register Allocation Was the Missing Piece

Every stencil loads its inputs from the CPython frame structure and stores its outputs back to it. Across a sequence of uops, values flow through memory rather than through CPU registers.

A trace computing (a + b) * c maps to two uops:

BINARY_OP_ADD_INT   # loads a and b from frame; stores (a+b) to frame
BINARY_OP_MUL_INT   # loads (a+b) and c from frame; stores result to frame

The intermediate value a + b is written to memory by the first stencil and read back by the second. On modern hardware, an L1 cache hit is fast, but for tight numeric loops this round-trip adds up across many iterations and many uop pairs.

A conventional JIT’s register allocator eliminates this by tracking which values are already in registers across operations. With copy-and-patch, the stencils are opaque pre-compiled blobs; each is independently compiled, and there is no mechanism for stencil A to guarantee it left a value in a specific register for stencil B. The result is that the Python 3.13 JIT, despite generating native code, produced output whose performance was close to the tier 1 interpreter for numeric work. The native-code overhead from spills and reloads partially offset the gains from type specialization.

This was the core reason the 3.13 JIT shipped as experimental and disabled by default. The architecture was sound; the code generation was leaking performance at every uop boundary.

The 3.15 Approach

Python 3.15 introduces register allocation at trace-compile time. During the CPython build, multiple variants of each stencil are compiled, each expecting inputs in different registers and leaving outputs in different registers. When the JIT compiles a trace, a lightweight register allocator selects which variant to use for each uop, threading values through registers across boundaries instead of spilling them through the frame structure.

The allocator runs at trace-compile time and is fast because it does not perform full optimization, only variant selection. The code bodies remain pre-compiled; only the wiring between them is decided at runtime. This preserves the compilation speed advantage of copy-and-patch while closing most of the code-quality gap for chains of arithmetic and comparison uops.

This pattern has precedent. LuaJIT assembles multiple code paths and selects among them based on register state at emit time, keeping JIT-compile cost low while avoiding unnecessary memory traffic. The mechanisms differ between LuaJIT and CPython, but the underlying principle is the same: pre-generate variant stencils and select among them at emit time, rather than generating new code from scratch.

It is worth noting that LLVM remains a build-time dependency, not a runtime one. The CPython binary ships with pre-compiled stencil variants baked in; the runtime register allocator is a lightweight selector, not a code generator. This keeps the deployment story clean and avoids the startup overhead that a runtime LLVM dependency would impose.

What to Expect From 3.15

Register allocation improvements benefit tight numerical loops, the category of workloads where the JIT compiles a hot trace and executes it many times on scalar values. The gains will be most visible on benchmarks that stress integer or float arithmetic in pure Python.

For code dominated by object allocation, attribute lookup, or I/O, the improvement will be minimal. Those workloads spend most of their time in the Python object system and the C extension layer, where the JIT has little influence regardless of how efficiently it chains uops. The pyperformance suite spans both categories, so headline benchmark numbers will average together genuine gains and workloads where the JIT has nothing to contribute.

The broader trajectory matters more than any single benchmark. Python 3.13’s JIT was a design-complete but performance-incomplete implementation. The tier 2 optimizer could specialize and optimize aggressively, but the native code generator could not take full advantage because values were spilling at every uop boundary. The 3.15 register allocator closes that gap in the code generation stage, which means the tier 2 optimizer’s work shows up more fully in execution time.

For numerically intensive Python code written without Cython, NumPy, or other extension-based acceleration, the 3.15 improvements represent a genuine step toward the JIT paying for itself across a wider range of inputs. The groundwork laid in 3.13 and 3.14 was always pointed in the right direction; the register allocator is what the pipeline needed to deliver on it.

Was this interesting?