· 7 min read ·

From V8 to CPython: The JIT Maturity Path Python Is Finally On

Source: hackernews

Python’s JIT compiler has a decade of precedent to draw on. JavaScript JIT compilers went through the same developmental stages in the late 2000s and early 2010s, and their history maps cleanly onto what CPython is doing in 3.15. The news from Ken Jin that Python 3.15’s JIT is back on track tracks a pattern V8 and SpiderMonkey already completed: type speculation, then IR-based optimization, then register allocation, each layer depending on the previous one.

V8’s First Run at This Problem

When V8 launched in 2008 it was a full JIT compiler with no interpreter. The team at Google, led by Lars Bak, made an explicit bet that eliminating interpreter overhead was worth the startup cost. V8 compiled JavaScript to native code using two tiers from the beginning: a “full compiler” that produced adequate code quickly, and a profiling-based optimizer that recompiled hot functions more aggressively.

V8’s full compiler had limited register allocation. It kept some values in registers within individual expressions but did not perform liveness analysis across statement boundaries. Values that crossed statement boundaries were spilled to a hidden stack slot. This was competitive enough for 2008, but as benchmark suites grew more demanding the limitations became visible.

Crankshaft, V8’s first serious optimizing compiler introduced in 2010, added proper linear scan register allocation over a static single-assignment IR. SSA form makes liveness analysis tractable because each value is defined exactly once, so liveness intervals are explicit in the representation. Linear scan could assign registers without building full interference graphs. The improvement on numeric and array-heavy code was substantial.

Crankshaft had a fundamental limitation though. It was built on the assumption that type information was available for the values it optimized. When JavaScript code exhibited unusual type patterns, Crankshaft bailed out. Some JavaScript patterns were simply never compilable by Crankshaft: certain uses of try/catch, with statements, specific argument handling idioms. These became known as “optimization killers” and caused real pain for library authors trying to write fast JavaScript.

TurboFan and the Lesson About IR Design

V8’s replacement project, TurboFan, shipped in 2017 alongside the Ignition bytecode interpreter. Turbofan’s key contribution was treating type uncertainty as a first-class concern at the IR level rather than a fallback path. Every optimization is conditioned on type guards; when a guard fails at runtime, deoptimization follows a well-defined path back to Ignition rather than causing a complete recompilation. This meant the compiler could speculate aggressively without breaking semantics when those speculations were wrong.

Ignition itself was also a departure from V8’s original philosophy. Adding a bytecode interpreter to what was originally a pure JIT compiler let V8 handle startup latency and cold code more efficiently, leaving the JIT to focus on hot paths. This is important context: the tiered model is not a compromise. It is the mature architecture.

Mozilla’s SpiderMonkey followed a parallel path. JägerMonkey (2010) was method-level JIT compilation without full register allocation across the whole function. IonMonkey (2013) added SSA IR and linear scan register allocation, with type information flowing from SpiderMonkey’s type inference system into the compiler. The pattern repeats: type inference first, IR second, register allocation third.

Where CPython Is on This Same Path

CPython’s architecture maps onto the JavaScript trajectory with a different starting point. JavaScript engines started as JIT compilers and later added interpreters; Ignition arrived in 2016, a full eight years after V8’s initial release. CPython started as an interpreter and is layering JIT compilation on top. Both paths converge on the same destination: a tiered system where an interpreter handles cold code and startup, and an optimizing compiler handles hot paths.

CPython’s current tiers:

Tier 1: The Specializing Adaptive Interpreter introduced in 3.11. This is the analog to V8’s inline caching and type feedback collection. When a BINARY_OP instruction sees integers consistently, it specializes to BINARY_OP_ADD_INT with a guard that deoptimizes back to the generic form if the type changes. This is where CPython collects the type information that higher tiers need, and it delivered roughly 25% speedup across the pyperformance benchmark suite compared to 3.10.

Tier 2: The micro-op IR and optimizer. This corresponds to V8’s Crankshaft or SpiderMonkey’s IonMonkey: a lower-level representation where optimization passes run over a trace before it reaches the code generator. In Python 3.13, the tier-2 framework existed but lacked complete type inference passes. Without type information flowing through the uop IR, the optimizer had nothing to work with and the copy-and-patch JIT below it received unoptimized traces.

Tier 3: Brandt Bucher’s copy-and-patch JIT from PEP 744. It takes the tier-2 uop sequence and emits native code by copying pre-compiled stencils (built with LLVM at CPython’s build time) and patching in runtime addresses and constants. Analogous to V8’s “full compiler” in its early form, except designed from the start as the downstream consumer of tier-2 optimization rather than a standalone system.

The 3.13 JIT underperformed because tier 2 was incomplete. The stencil-based code generator received uop sequences with no type annotations and no cross-stencil register assignments, so every stencil emitted loads and stores to the Python value stack in memory. The generated native code was faster than the interpreter in a narrow sense (no dispatch overhead), but it preserved all the memory traffic the interpreter also had.

Register Allocation Without Type Inference Is Not Enough

This is the part the JavaScript history makes clear. V8’s full compiler had rudimentary register allocation that produced some benefit. But it was only when Crankshaft combined proper SSA-form type information with linear scan register allocation that the performance gains became substantial. The two are not independent optimizations. Register allocation becomes dramatically more effective when the allocator knows that a value is an unboxed integer, because that means the value itself fits in a register rather than a pointer to a heap-allocated object.

For CPython, the same dependency holds. If the tier-2 optimizer knows from type inference that x in a hot loop is always a Python int, it can:

  • Hoist the type guard out of the loop body (check once, not per iteration)
  • Represent x as an unboxed machine integer across stencil boundaries
  • Keep x in a register without emitting GC-visible object references

Without type inference, the register allocator would assign a register to x, but that register would hold a pointer to a PyObject on the heap, and every arithmetic operation on it would still involve boxing and unboxing. The register allocation saves the memory traffic of the value stack, but not the heap allocation traffic of Python integer objects.

With both, the generated code for a tight integer loop can stay entirely in registers and avoid the heap for intermediate values. That is the difference between a modest speedup and a meaningful one.

The pseudoassembly contrast is instructive. Without type inference and register allocation:

; per-iteration, for a + b where types are unknown
mov  rdi, [rbp + a_slot]     ; load PyObject* a from value stack
mov  rsi, [rbp + b_slot]     ; load PyObject* b from value stack
call PyNumber_Add            ; generic add, handles any type
mov  [rbp + result_slot], rax; store result back to value stack

With type inference confirming both are int and register allocation keeping them live across iterations:

; a in r12, b in r13, unboxed integers, held across iterations
add  r12, r13                ; pure integer add, no dispatch

The function call, the heap loads, and the heap store disappear. For a loop that runs a million times, that gap is not marginal.

What the JavaScript Trajectory Predicts

JavaScript JIT maturity did not stop at register allocation. After IonMonkey and TurboFan had that working correctly, the subsequent gains came from areas that CPython’s tier-2 optimizer is now positioned to pursue:

Escape analysis: determining that a short-lived object never escapes the current trace, eliminating the heap allocation. V8 and SpiderMonkey both invested heavily in this for object-heavy JavaScript. For Python, this would target temporary tuples, intermediate integer objects, and results from comprehensions inside loops.

Shape-based inline caching: specializing not just on type but on the exact “shape” of an object (its attribute layout). CPython already has version tags on types and specializes at the bytecode level; threading shape information into tier-2 traces would allow more guard elimination.

Cross-call inlining: when a hot function calls another hot function with a known signature, inline the callee into the caller’s trace and optimize across the boundary. This is one of the largest speedup sources in JavaScript JITs for object-oriented code, and it is the same mechanism that lets V8 optimize arr.push(x) without an actual function call overhead.

None of these require abandoning copy-and-patch. They require richer analysis in tier 2 feeding more information to tier 3. The stencil model scales to support them.

One Important Difference

Python has a C extension ecosystem built against CPython’s object model: NumPy, SciPy, lxml, most database drivers. The JIT cannot inline across Python-to-C extension boundaries the way V8 inlines JavaScript-to-JavaScript calls. The optimization domain is pure Python hot paths.

For scientific computing workloads, Numba and JAX already provide LLVM and XLA-based JIT compilation that handles NumPy arrays directly. CPython’s JIT is aimed at general-purpose Python: web framework routing, data transformation pipelines, business logic, the code that runs slowly without being amenable to dropping into Cython.

Browser JavaScript also cannot inline into the DOM’s C++ core or across WASM boundaries, and V8 produces real speedups within its domain despite that constraint. The scope limitation is real but not fatal.

The 3.15 progress is the point where CPython’s JIT infrastructure is complete enough for the interesting optimization work to begin. The JavaScript ecosystem took roughly a decade to go from early JITs to the current TurboFan and Warp backends. CPython started later, is moving faster, and now has a working model of what the destination looks like.

Was this interesting?