· 6 min read ·

Python's JIT Problem Was Never the Code Generator

Source: lobsters

The copy-and-patch JIT that shipped experimentally in Python 3.13 was described as experimental, but that framing undersold the real problem. The update that Python 3.15’s JIT is back on track points to something more specific than a rough debut being smoothed over. The 3.13 JIT worked. It produced correct code. It just did not produce meaningfully faster code for most workloads, and the reason comes down almost entirely to the optimizer sitting behind the code generator, not the code generator itself.

CPython’s Tiered Execution Model

Understanding why requires stepping through CPython’s current execution stack, which has grown substantially more layered since Python 3.11.

Python 3.11 introduced the specializing adaptive interpreter as the first major output of the Faster CPython initiative. When a bytecode instruction executes enough times, the interpreter replaces it in place with a type-specialized variant. LOAD_ATTR becomes LOAD_ATTR_INSTANCE_VALUE when the interpreter observes that the attribute access consistently hits an instance dictionary at a known offset. BINARY_OP becomes BINARY_OP_ADD_INT when both operands are reliably small integers. These rewrites happen in the bytecode stream itself, with no permanent commitment: if types change, the instruction despecializes back. This is what delivered the roughly 25% speedup across pyperformance that made Python 3.11 notable.

Python 3.12 and 3.13 extended this into a two-tier system. Tier 1 is the specializing interpreter. Tier 2 is the uop optimizer, which decomposes hot bytecodes into micro-operations (uops) and runs optimization passes over sequences of them before handing off to either an interpreter loop or, with the JIT flag enabled, the copy-and-patch code generator.

The copy-and-patch layer is the least architecturally interesting part of this stack. At build time, clang compiles a machine code template for each uop, leaving relocatable slots where runtime addresses and constants will be substituted. At runtime, compiling a trace means copying those templates into executable memory and filling the slots with actual values. The mechanism requires clang at build time even on GCC-default platforms, which PEP 744 notes as a build dependency, but the code generation itself is fast and produces compact output. What the code generator cannot do is eliminate guards; that work belongs entirely to the optimizer.

The Guard Elimination Problem

A trace in a tracing JIT is a linear sequence of uops representing a frequently-taken path, typically through a loop body. Every branching decision that was resolved when the trace was recorded becomes a guard in the compiled code: a type check, a bounds check, a None check. On re-entry to the trace, each guard confirms that the conditions still hold. If one fails, execution exits the trace and falls back to the interpreter.

In a tight Python loop over integers, a naive trace might contain something like the following:

GUARD_TYPE(x, int)
GUARD_NOT_OVERFLOWED(x)
LOAD_FAST(x)
LOAD_CONST(1)
BINARY_OP_ADD_INT(x, 1)
STORE_FAST(x)
GUARD_TYPE(x, int)   # after the addition

The second GUARD_TYPE after the addition is redundant. BINARY_OP_ADD_INT only succeeds and returns an int; a smarter optimizer can eliminate the subsequent check because it can prove the result type from the operation’s semantics. But in Python 3.13’s uop optimizer, the type lattice was coarse. It could track whether a value was an int, a float, a str, or unknown, but it could not track narrower properties: whether the integer was small enough to avoid heap allocation, non-negative, or bounded in any useful way.

This matters because many guards in CPython’s uop stream require knowing something narrower than just “this is an int.” CPython’s BINARY_OP_ADD_INT fast path avoids memory allocation for small integers that fit within a fixed digit count, but confirming eligibility requires a runtime check. Without the ability to propagate “small int” as a distinct type state through the optimizer, the compiled code had to keep performing that check at every loop iteration. The JIT-compiled trace ended up doing roughly the same work as the specializing interpreter, with the added overhead of trace management and guard setup.

What the 3.15 Work Addresses

The optimizer in Python 3.15 is gaining a richer type lattice. Tracking not just whether a value is an int but whether it is a small int, a known-positive int, or carries other narrow properties means the abstract interpretation pass can propagate those states forward through the trace and prune guards that would always pass.

This is the core unlock. Once the optimizer can prove that a value exiting a given operation is a small int, it can eliminate the subsequent small-int check at the next loop iteration head. Eliminate enough of those checks across a loop body and the compiled trace starts to look genuinely different from what the interpreter already achieves with specialization alone. The source article signals that specific blockers in this area have been resolved, suggesting the lattice is now expressive enough to cover the guard patterns that were previously unprovable.

Trace formation has also been revised to allow longer traces. Longer traces expose more redundant guards to the analysis pass, since a property proven at one point in the trace can suppress checks later in the same sequence. The cost is more compilation time and more trace cache memory per entry, but those trade-offs become favorable once the speedups are real rather than theoretical.

There were also correctness gaps in 3.13 where some code patterns triggered conservative fallbacks that prevented JIT compilation. Precise handling of those cases is recovering JIT coverage for paths that were unnecessarily excluded, which matters for the benchmark numbers independently of the guard elimination improvements.

The Connection Back to Python 3.11

One way to frame this arc: the specializing interpreter in Python 3.11 was an implicit type analysis system. It observed types at runtime and embedded the resulting type knowledge directly into the bytecode stream. The JIT’s uop optimizer is an attempt to make that type reasoning explicit, systematic, and extendable to multi-instruction inference chains.

The 3.11 design specialized individual instructions in isolation. The uop optimizer can reason across sequences of instructions. That is strictly more powerful, but only to the degree that the type lattice can express what operations guarantee about their outputs. In 3.13, the lattice was not expressive enough to cover the cases where CPython’s own fast paths already had implicit type assumptions embedded. In 3.15, it is getting closer.

PyPy’s RPython JIT framework, for reference, has had typed abstract interpretation over its tracing system for over a decade. That is a substantial part of why PyPy sustains five to ten times CPython’s throughput on compute-intensive benchmarks. GraalPy, running on GraalVM’s Truffle framework, can exceed PyPy on some workloads because it has access to decades of research in speculative optimization and partial evaluation. CPython cannot import that work directly, both because of architectural differences and because alternative implementations are permitted to make assumptions about the runtime that CPython must not. But the direction is consistent.

What “Back on Track” Means for Users

For Python users, “back on track” should translate to something concrete: the JIT moving from an opt-in experiment to a default-on feature. Python 3.13 shipped the JIT behind a flag because the performance was not compelling enough to impose the overhead universally. If the 3.15 optimizer improvements deliver sustained gains across pyperformance workloads rather than narrow wins on specific microbenchmarks, the argument for default-on becomes defensible.

The workloads that benefit remain specific: tight loops over typed data in pure Python, parsers, small numeric routines that have not been delegated to NumPy or similar C extensions. Code that spends most of its time inside native libraries will not see improvement from a JIT that operates only on Python-level uops, and that describes a large fraction of production Python. For the fraction that runs pure Python in hot loops, the performance gap versus alternatives like PyPy has historically been the standing argument for switching runtimes. A reliably faster CPython JIT narrows that argument without requiring a runtime change.

Register allocation, type-specialized code generation, and loop-level transformations are harder problems that become tractable once the optimizer can reliably reason about types. The current releases are building the preconditions for those improvements. The type lattice work in 3.15 is the same kind of foundational move the specializing interpreter was in 3.11: not transformative on its own, but necessary for everything that follows.

Was this interesting?