Copy-and-Patch Finally Gets Register Allocation Right in Python 3.15
Source: lobsters
Python’s JIT compiler has been the subject of cautious optimism for a few years now. It shipped experimentally in 3.13, stayed off by default, and for most workloads it was measurably slower than just running the bytecode interpreter. That’s not a knock on the people building it; it’s a predictable consequence of how JIT compilers have to earn their overhead. The progress report on Python 3.15’s JIT, written by CPython core developer Ken Jin, is the first time the numbers actually look encouraging in a non-cherry-picked way.
To understand why this took until 3.15, you need to understand what the copy-and-patch JIT actually is, what it was missing, and why register allocation in particular was the blocking problem.
What Copy-and-Patch Is
Most JIT compilers generate machine code at runtime by emitting instruction sequences, performing register allocation, and doing all the classic compiler backend work dynamically. That’s fast to execute but expensive to compile. Copy-and-patch takes a different approach: at CPython’s build time, a C compiler (Clang, specifically) compiles pre-written “stencils” for each bytecode operation and records where the runtime values need to be patched in. At runtime, the JIT copies those pre-compiled stencils into memory and patches the variable slots with actual addresses and constants.
This technique came from a 2021 OOPSLA paper by Xu Chen et al. and was already being used in .NET’s NativeAOT compiler and V8’s Maglev tier before CPython adopted it. The advantage is compile latency: generating a JIT trace takes microseconds rather than the milliseconds a full LLVM pipeline would require. The disadvantage is that the quality of the generated code is constrained by whatever the stencils produce.
And that’s where the register allocation problem lives.
The Register Allocation Gap
When CPython’s interpreter executes bytecode, it maintains an “evaluation stack” in memory. Operands are pushed and popped from this stack as operations run. The specializing adaptive interpreter introduced in 3.11 made this faster by caching type information and eliminating many dispatch overhead costs, but the fundamental stack-in-memory model remained.
The early copy-and-patch JIT essentially preserved this model. The stencils for each operation would load values from the stack frame, do some work, and store results back. From a correctness standpoint that’s fine. From a performance standpoint it means the CPU is spending cycles on loads and stores that could be eliminated if values were kept in registers between operations.
Register allocation, in this context, means the JIT figures out that a value produced by one stencil is immediately consumed by the next and arranges for that value to stay in a CPU register rather than bouncing through memory. On x86-64, you have sixteen general-purpose registers; a tight numeric loop that bounces values through the stack instead of using those registers is leaving significant performance on the table.
The interpreter can afford to be stack-based because it has other overheads dominating the cost. The JIT, which is supposed to reduce those overheads, ends up being slower if it also has to do all those stack loads and stores that a good compiled loop wouldn’t need.
Why This Took Until 3.15
The copy-and-patch design makes register allocation harder than in a traditional JIT. Because stencils are pre-compiled by Clang, the JIT can’t just decide at runtime to rewrite how a stencil uses registers. The stencils have fixed register usage dictated by the calling convention and Clang’s code generation.
The solution Ken Jin and the Faster CPython team landed on involves extending the stencil system to generate variants for different register assignments and threading register liveness information through the trace. When the JIT builds a trace of hot operations, it can now determine which values are “live” across stencil boundaries and select stencil variants that keep those values in registers rather than spilling them to the stack.
This required changes at multiple levels: the stencil generation tooling at build time, the trace compilation logic at runtime, and the representation of the type information flowing through the tier 2 optimizer. It’s not surprising it took several release cycles to get right.
The Tier Architecture Context
CPython’s execution model since 3.13 has three effective tiers. Tier 1 is the specializing adaptive interpreter, which counts how many times each bytecode gets executed, infers types from observed values, and replaces generic operations with type-specific “specialized” variants. This is what gave 3.11 and 3.12 their significant speedups.
Tier 2 is the optimizer and JIT. When a loop runs often enough, the tier 2 system traces the hot path, builds an intermediate representation called “micro-ops,” applies optimizations like constant propagation and guard elimination, and hands the result to the copy-and-patch compiler. The JIT then emits native code for that trace.
The quality of what the JIT produces depends heavily on what the tier 2 optimizer was able to determine. If the optimizer could prove that a variable always holds a Python integer of a certain magnitude, the JIT can skip the type check guards entirely. Register allocation operates on top of this: once the optimizer has trimmed the unnecessary operations, the JIT can assign registers to the values that remain.
What the Benchmarks Show
Prior to these improvements, enabling the JIT with --enable-experimental-jit in 3.13 would often yield results that were a few percent slower than the default interpreter on the pyperformance benchmark suite. There were individual benchmarks where it helped, mostly numeric loops where the guard-elimination benefits were large enough to outweigh the registration overhead, but the aggregate picture was not good.
With the register allocation work in 3.15, the aggregate is now net positive on the benchmarks that exercise hot loops with numeric or container operations. The improvement isn’t uniform; Python’s benchmark suite covers a wide range of workloads, and the JIT only helps code that’s in hot enough loops to get traced. Startup-heavy or I/O-bound workloads see no benefit because those code paths never warm up.
The more meaningful comparison is against other Python JIT efforts. PyPy’s JIT has been doing register allocation for years and achieves genuinely impressive throughput on numeric workloads, often five to ten times faster than CPython on tight loops. CPython’s JIT is not chasing those numbers; the design goals are different. PyPy is an alternative runtime that can diverge from CPython semantics in edge cases; CPython’s JIT has to work within the existing object model and maintain full CPython compatibility. The constraint space is much tighter.
Why Previous Attempts Failed
This is the part that the current progress deserves to be read against. Python has been here before. Unladen Swallow, a Google-funded project, tried to replace CPython’s bytecode compiler with an LLVM backend around 2009 and 2010. The project eventually stalled because LLVM’s compilation latency was too high for the short-lived objects and small functions typical in Python programs, and the optimizer’s assumptions about value types didn’t survive contact with Python’s dynamic dispatch.
Psyco worked for Python 2, generating type-specialized native code, but didn’t survive the Python 3 transition. Cinder, Meta’s production fork of CPython, has a JIT that powers Instagram’s backend, but it’s a fork with divergent maintenance costs. Pyjion tried a plugin-based JIT using .NET’s runtime and never got traction.
The copy-and-patch approach avoids the Unladen Swallow trap by keeping compile latency extremely low. The focus on CPython’s own tier 2 optimizer rather than an external compiler avoids the semantics mismatch problem. The incremental development model, where the JIT ships disabled and gets turned on only as the numbers improve, avoids the trap of announcing results that don’t hold up in production.
What Comes Next
Register allocation was the structural missing piece, but it’s not the only remaining work. The tier 2 optimizer’s type inference is still conservative in many cases, which means the JIT ends up emitting more guards than necessary. Improving the type analysis upstream of the JIT will directly improve the quality of what gets compiled. There’s also ongoing work on the stencil generation tooling to give Clang more hints about what to optimize for.
The JIT will likely remain experimental and disabled by default in 3.15 as the team builds confidence in the register allocation changes across different platforms and CPU architectures. The copy-and-patch stencils have to be generated and tested on x86-64, ARM64, and the other supported targets, and register assignments that work well on one architecture can behave differently on another.
Getting to break-even with the interpreter was the necessary precondition for the JIT to be taken seriously. 3.15 appears to be the release where that precondition is finally met.