· 8 min read ·

Register Allocation Was Always the Bottleneck in CPython's JIT

Source: hackernews

Python 3.13 shipped an experimental JIT compiler that was, by most benchmarks, slower than the interpreter it was supposed to accelerate. That outcome was not a failure of the overall design, though it certainly looked like one from the outside. It was a deliberate trade-off: lay the infrastructure, accept the overhead, fix the fundamentals in a later release. Python 3.15 is where those fundamentals get fixed, and the progress report from Ken Jin is worth understanding in detail because the core problem turns out to be one of the most fundamental problems in compiler construction.

The problem is register allocation.

What CPython’s JIT Actually Is

Before getting into what was broken, it helps to understand the architecture. CPython has operated on a two-tier execution model since Python 3.12. The first tier is the specializing adaptive interpreter introduced in PEP 659: as bytecode executes, inline caches record the types observed at each instruction site, and after enough repetitions, generic instructions like LOAD_ATTR get replaced in-place with specialized variants like LOAD_ATTR_INSTANCE_VALUE that skip the generic dispatch and go directly to the fast path. This alone gave CPython around a 25% speedup in Python 3.11.

The second tier is a trace-based optimizer that kicks in when the tier-1 interpreter detects a hot backward jump, which almost always means a loop. It builds a linear sequence of micro-ops, or uops, representing what the loop body actually does, with the type specialization assumptions encoded as guard checks. The tier-2 optimizer runs passes over this uop trace: constant folding, guard elimination, limited type propagation. Then the JIT takes over.

The JIT itself uses a technique called copy-and-patch, from a 2021 paper by Xu et al. The core idea is to do the hard compiler work at CPython build time rather than at Python runtime. At build time, Clang compiles a “template” C function for each uop into an object file. These templates are not fully linked; wherever the template needs a runtime value (an object address, a jump target, a frame field offset), a relocation is left as a hole. The build system extracts the machine code bytes from each template and stores them as stencils: byte arrays paired with metadata describing what goes in each hole.

At runtime, the JIT works by copying a stencil’s machine code into an executable memory region, then patching the holes with the actual values. No LLVM, no IR construction, no optimization passes at runtime. The cost of “compiling” a trace is a handful of memcpy calls and pointer patches. This is why copy-and-patch is attractive: it gives you native code with essentially zero compilation latency.

The Stack Spill Problem

Here is where the design runs into trouble. Each stencil is compiled independently by Clang at build time. Clang has no idea what stencil came before or after in a trace; it only sees one uop’s worth of code. This means each stencil must treat CPU registers as if entering cold: it cannot assume any particular register holds any particular value from the previous stencil. The stencils communicate through memory.

In CPython, that memory is the evaluation stack: a region in the frame that holds intermediate values between bytecode operations. So when the JIT assembles a trace like _BINARY_OP_ADD_INT followed by _STORE_FAST, the integer addition result does not stay in a register. It gets written to the eval stack at the end of the addition stencil, then reloaded from the eval stack at the start of the store stencil. That is a store followed immediately by a load of the same value.

Modern CPUs can hide some of this overhead via store-to-load forwarding, where the load unit detects that the address being loaded was just written and bypasses the cache. But store-to-load forwarding has latency, consumes load/store buffer entries, and is not free. More importantly, it does not help with values that span more than one intermediate uop, and it does not eliminate the instruction count overhead.

Across an entire trace, this pattern repeats at every uop boundary. Every value computed in one stencil and consumed in the next one goes through memory. The JIT was generating native code, but it was generating native code that resembled a very fast interpreter more than it resembled what a compiler would produce for the same computation.

Ken Jin’s post is direct about this: the spill traffic was large enough that the JIT’s native code was losing to the tier-1 interpreter, which at least had the benefit of simpler, more cache-friendly memory access patterns and no compilation overhead to amortize.

What 3.15 Adds

The central fix is a linear-scan register allocator operating over the uop trace. Linear-scan register allocation is a well-understood algorithm: scan uops in order, track which values are live at each point (liveness analysis), assign registers to live values, and spill to memory when register pressure exceeds the available register count. It runs in O(n) time on the trace length, which is what you want when compilation latency matters.

The architectural challenge is that the copy-and-patch approach was not designed for this. Stencils have fixed machine code with fixed register usage; Clang chose whatever registers it wanted when compiling each template. Making stencils compatible with a runtime register allocator requires making register references into patchable holes, the same way constant addresses are patchable holes. This means the build system needs to generate stencil variants where register encodings are treated as relocations rather than fixed bytes, and the JIT emitter needs to patch those fields with the allocator’s register assignments alongside the runtime values.

This is more complex than the original copy-and-patch design, but it preserves the essential property: no LLVM at runtime, no IR construction, just fast stencil patching.

The second major improvement is more aggressive type propagation through the tier-2 optimizer. When a _GUARD_TYPE_VERSION check succeeds for an object, the optimizer now propagates the knowledge that the object has a known type version to all downstream uops that reference it. This allows eliminating redundant guards: if x has already been confirmed as an int, later operations on x do not need to re-verify its type. In typical Python loops where the same variables are accessed repeatedly, this can eliminate a substantial fraction of the guard branches. Ken Jin’s early benchmarks showed guard elimination removing 30 to 50 percent of guards in common loop patterns.

Type propagation also enables stronger code generation. Knowing both operands of an addition are integers allows the JIT to emit a direct integer add rather than calling PyObject_Add, which dispatches through the object’s type’s nb_add slot. Combined with register allocation keeping both operands in registers, this starts to resemble what you would get from a static compiler.

The 3.15 work also expands trace coverage by supporting function call inlining inside traces. If a JIT-compiled loop calls a small known Python function repeatedly, the callee’s uops can be incorporated directly into the trace rather than deoptimizing back to the interpreter for each call. This matters because Python code tends to decompose into many small functions, and without inlining, loop-heavy code that calls helper functions cannot be fully compiled.

The Break-Even Threshold

One frame from Ken Jin’s post that clarifies the progression is the concept of break-even. The JIT has fixed costs: compilation time (even if small with copy-and-patch), executable memory allocation, and the overhead of guard checks in compiled code. For the JIT to be useful, the time saved by running native code must exceed these fixed costs across the lifetime of the trace.

In Python 3.13, the JIT was below break-even on most pyperformance benchmarks: the spill overhead made native execution slow enough that even amortizing the compilation cost over many loop iterations left the JIT behind the interpreter. Python 3.14 brought the JIT approximately to break-even on some benchmarks through incremental stencil improvements and better optimizer passes, but the structural register spill problem remained. Python 3.15, with register allocation, is the first release where the JIT is expected to be net positive across the pyperformance suite, with early prototype numbers showing 5 to 15 percent improvement on typical benchmarks and 20 to 30 percent on numeric loop-heavy workloads like nbody and spectral_norm.

Context Against Other JITs

Comparing CPython’s trajectory to PyPy is useful not to be discouraging but to calibrate expectations. PyPy’s meta-tracing JIT is 15 years old, includes full register allocation, unboxing of numeric types, aggressive escape analysis, and loop optimization. PyPy is typically 5 to 10 times faster than CPython on CPU-bound pure-Python code. CPython is not trying to match that. The goal is to improve on CPython’s own baseline while keeping full compatibility with C extensions, which PyPy has historically struggled with because C extensions written for CPython’s C API make assumptions about object layout that PyPy’s object model violates.

A closer comparison is LuaJIT, which uses a trace-based JIT with linear-scan register allocation, type inference, and numeric unboxing, and achieves 10 to 50 times faster execution than the reference Lua interpreter on tight loops. Lua’s type system is simpler than Python’s, which makes the optimization problem easier, but the architectural approach (trace JIT, copy-and-patch-adjacent stencil emission, register allocation) is what CPython is now converging toward.

The V8 JavaScript engine comparison is less relevant here. V8 uses method-based compilation rather than trace-based, and its full optimization pipeline (through TurboFan) runs a full SSA IR, global value numbering, and cross-function inlining. That is a different engineering trade-off: much higher compilation latency, much higher peak performance potential. CPython’s copy-and-patch approach deliberately trades peak performance ceiling for near-zero warmup cost, which is appropriate for Python’s typical workload distribution: many short-lived scripts, web servers where startup latency matters, scientific code where the bottleneck is often in C extension calls anyway.

What Comes After

The 3.15 JIT being on track is meaningful not because 3.15 will make CPython competitive with PyPy, but because it establishes that the copy-and-patch architecture can actually deliver performance gains once register allocation is in place. The infrastructure built for 3.15 (parameterized stencils, a working register allocator, stronger type propagation) is the foundation for more aggressive optimizations in later releases: numeric type unboxing, escape analysis for frame objects, wider inlining.

The CPython issue tracker and the python-dev mailing list have been tracking these developments openly. For anyone building performance-sensitive Python systems today, none of this changes the calculus yet: PyPy or Cython are still the right tools for CPU-bound work where CPython’s baseline is insufficient. But 3.15 is the first release where running the JIT will not cost you anything, and that is a prerequisite for everything that follows.

Was this interesting?