CPython Builds Its JIT at Compile Time, and That Is the Clever Part
Source: hackernews
The headline about Python 3.15’s JIT being back on track rightly focuses on register allocation as the critical fix. But there is a prior question worth understanding before you can appreciate why the register allocator matters: what does CPython’s JIT actually do at runtime, and what work has already been done by the time your Python code runs?
The answer involves LLVM, but not in the way Unladen Swallow used it.
CPython’s JIT Contains No Compiler
This is the distinguishing fact about copy-and-patch. When you run CPython 3.15, there is no compiler in the process. LLVM is not linked into the runtime. The code generator that translates hot Python traces to native machine code does not parse, does not build an IR, does not run optimization passes. At runtime, compilation is a memory copy followed by address patching.
The compiler was run at CPython’s build time. When you, your distro, or python.org builds CPython from source, Clang compiles C implementations of each micro-operation into native code. The build system then processes these object files to extract per-uop machine code as byte arrays, identifies which byte positions contain relocatable references (the “holes”), and bakes all of that into static tables compiled into the CPython binary itself.
By the time you are running Python code, every stencil already lives in CPython’s data segment. JIT compilation means indexing into those tables, copying the right sequence of bytes into executable memory, and filling in the holes with values known only at runtime: jump targets, object addresses, cache offsets, stack slot indices.
What a Stencil Contains
Each uop has a corresponding stencil. Consider _BINARY_OP_ADD_INT, the uop for integer addition. At build time, Clang compiles a C implementation that looks roughly like this (simplified from the actual Python/executor_cases.c.h):
PyObject *left = stack_pointer[-2];
PyObject *right = stack_pointer[-1];
if (!PyLong_CheckExact(left) | !PyLong_CheckExact(right)) {
goto deoptimize; // hole: runtime address of deoptimize target
}
PyObject *result = _PyLong_Add((PyLongObject *)left,
(PyLongObject *)right);
stack_pointer[-2] = result;
stack_pointer--;
Clang compiles this with -O2 and flags that treat certain values as externally defined symbols. These become relocatable references in the resulting object file. The CPython build tooling walks each object file’s relocation table, records which byte offsets need patching and what kind of value each requires, and produces a stencil descriptor:
typedef struct {
const uint8_t *bytes; /* machine code bytes, baked into the binary */
size_t nbytes;
const Hole *holes; /* {offset_into_bytes, hole_kind} pairs */
size_t nholes;
} Stencil;
At JIT compile time, the runtime copies stencil.bytes into its executable buffer and iterates stencil.holes, writing runtime-known values into the specified offsets. For a typical arithmetic uop, this might mean patching four to eight 64-bit values into a 40-80 byte machine code sequence. The whole operation takes microseconds.
This is what separates copy-and-patch from what Unladen Swallow attempted. Unladen Swallow ran LLVM at runtime: compiled Python bytecodes to LLVM IR, ran optimization passes, emitted native code per function. LLVM’s optimization pipeline was built for ahead-of-time compilation; its startup costs and IR construction overhead assumed the result would be used heavily. For Python functions that run a few thousand times, those costs did not pay off. Copy-and-patch’s runtime cost is low enough that CPython can begin JIT-compiling a trace after observing it around sixteen times, far below any threshold a traditional optimizing JIT could afford.
The Problem That Stencil Composition Introduces
Stencils are designed to be composed. The JIT assembles a trace by concatenating stencil copies in sequence, patching each one’s jump targets so control flows from one to the next without interpreter dispatch overhead. This composition is the fundamental runtime operation.
The problem is that stencils, as originally designed for CPython 3.13, were self-contained units. Each stencil expected its inputs on the Python value stack, a C array in the frame object accessed through stack_pointer, and wrote its outputs back to the same stack. There was no mechanism for adjacent stencils to share register state.
This meant that even when the tier-2 optimizer had proven a value flows directly from one uop to the next with no intervening code, the emitted machine code still materialized that value in memory between them. The stencils did not communicate register contents; each one started from the stack and ended at the stack.
On x86-64, a stack slot load from L1 cache costs three to four cycles. In a loop body consisting of several arithmetic uops, the load/store pairs between consecutive stencils could easily equal the interpreter dispatch overhead being eliminated. The pyperformance benchmarks showed near-neutral results through the 3.13 cycle because these costs approximately cancelled: dispatch overhead removed, memory round-trip overhead introduced. The tier-2 optimizer had correct analysis; the code generator had no path to use it.
How 3.15 Changes the Stencil Model
The register allocator in Python 3.15 adds a liveness analysis pass that runs over the tier-2 uop sequence before JIT compilation begins. For each position in the sequence, it determines which values are live: which values produced by earlier uops are still needed by later ones and have not yet been consumed.
The allocator assigns register locations to live values and communicates those assignments to the stencil emitter. When it emits a stencil for _BINARY_OP_ADD_INT, instead of patching the stencil to load from stack_pointer[-2], it patches it to read from the register where the previous uop left the value. This required extending the stencil format: the holes mechanism already existed for jump targets and addresses; 3.15 adds hole kinds for register operands, so the allocator can specify which physical register appears in the instruction encoding at each operand position.
The stencils themselves were recompiled to expose these register operand placeholders, and the build tooling was extended to classify the new hole types from the relocation information Clang emits. The format change is backward-compatible; stencils without register holes still exist for uops that must touch memory unconditionally.
The practical result: a sequence of integer arithmetic uops can now keep live values in registers through the full trace, with stack loads at trace entry and stores at trace exit. For a loop running a million iterations, the eliminated per-iteration memory traffic is substantial.
What the Stencil Model Still Cannot Do
The ceiling here is worth being honest about. Each stencil is compiled from C code that operates on PyObject * pointers and maintains CPython’s reference counting invariants. There is no mechanism to unbox a Python integer to a raw machine word within the stencil framework, because the stencil is compiled C that handles general Python objects.
PyPy’s tracing JIT can unbox integers in tight loops because its JIT generates IR that operates on raw values and uses escape analysis to determine when allocation can be elided entirely. On a loop doing millions of pure integer additions, PyPy’s emitted code can do add r12, r13 with no object overhead. CPython’s JIT, even with register allocation, still calls _PyLong_Add on heap-allocated PyLong objects, because the stencil was compiled from C that assumes reference-counted objects.
For compute-bound numeric code, Numba and JAX already handle this domain with LLVM and XLA backends that work at the array level. CPython’s JIT targets general-purpose Python: web framework code, data transformation pipelines, business logic. For those workloads, reducing interpreter dispatch and memory round-trip overhead within the existing object model is the right lever, even if it is not the same lever that makes PyPy fast on tight numeric loops.
The Gap the Register Allocator Closes
The important framing here is not “CPython’s JIT is now fast” but “CPython’s JIT is now honest.” Before 3.15, the tier-2 optimizer was producing correct and useful analysis: type information, guard elision, liveness data. That analysis existed in the uop IR and went unused at emit time, because the code generator had no mechanism to convert liveness information into register assignments across stencil boundaries.
The register allocator closes that gap. Analysis that was previously discarded now drives instruction encoding decisions. This matters not just for the direct speedup but because it means every subsequent investment in tier-2 analysis quality has a path to emitted code. Longer traces, better guard elision, stronger abstract interpretation across loop back-edges: all of these are easier to justify when the backend can actually use the information they produce.
Ken Jin’s write-up on the 3.15 progress describes consistent positive pyperformance improvements over the JIT-disabled baseline, which is the milestone that makes enabling the JIT by default in 3.15 defensible. The build-time stencil pipeline did not need to change architecturally; it needed one more hole kind and one more analysis pass at the boundary between tier 2 and tier 3. That addition completes the pipeline from uop IR to register-allocated native code, and the results reflect it.