· 7 min read ·

Getting to Break-Even: The Real Engineering Story Behind CPython's JIT Progress

Source: hackernews

The announcement that Python 3.15’s JIT is back on track is easy to misread as straightforward progress. The JIT shipped as experimental in CPython 3.13, stalled during 3.14’s development cycle, and is now showing positive benchmark numbers heading into 3.15. That narrative is accurate, but the interesting part is why getting to positive took this long, and what the architecture of CPython’s JIT makes that problem uniquely hard.

The Interpreter the JIT Has to Beat

Python’s JIT does not compete against a naive bytecode interpreter. It competes against the specializing adaptive interpreter introduced in PEP 659 and shipped in CPython 3.11, the same release that delivered roughly a 25% speedup over 3.10.

The adaptive interpreter works by observing what types flow through each bytecode instruction. After an opcode executes around eight times, CPython replaces it with a specialized variant that assumes the types it has seen. LOAD_ATTR becomes LOAD_ATTR_INSTANCE_VALUE when the attribute is always at a fixed offset in the object’s memory layout. BINARY_OP becomes BINARY_OP_ADD_INT when both operands are consistently plain integers. These specializations are stored inline in the bytecode stream alongside small caches for the type version tags observed, so the cost of dispatch drops substantially for a warm code path.

You can observe this in action using CPython’s dis module with the adaptive flag:

import dis

def add_ints(a, b):
    return a + b

# Call it a few times with integers to let specialization fire
for _ in range(16):
    add_ints(1, 2)

dis.dis(add_ints, adaptive=True)
# RESUME           0
# LOAD_FAST        0 (a)
# LOAD_FAST        1 (b)
# BINARY_OP_ADD_INT 0 (+)
# RETURN_VALUE

The practical effect is that the adaptive interpreter already does most of what a first-tier JIT does: it eliminates generic type dispatch, encodes type assumptions at each call site, and deoptimizes back to a generic form when those assumptions break. A JIT compiler built on top of this has to clear a high bar just to show improvement, because the worst-case baseline it competes against is not a naive interpreter, it is an interpreter that has already specialized for the observed types.

Copy-and-Patch: A Different Kind of JIT

PEP 744, authored by Brandt Bucher, introduced CPython’s JIT as an experimental feature in 3.13. The design is architecturally distinct from most JIT compilers people are familiar with.

Traditional JIT compilers, V8’s TurboFan or PyPy’s RPython-based tracing JIT, build an intermediate representation at runtime, run optimization passes over it, allocate registers, and emit machine code using a compiler backend like LLVM. This is powerful but expensive. LLVM-based JIT compilation takes tens to hundreds of milliseconds per compilation unit, and the warm-up cost is visible in short-running processes.

CPython’s JIT uses copy-and-patch, a technique described in a 2021 paper by Xu et al. that was already in use for .NET expression trees and parts of LuaJIT before CPython adopted it. Machine code templates for each “uop” (micro-operation) are compiled ahead of time by Clang during the CPython build, with relocatable placeholders left where runtime values need to go: addresses of Python objects, jump targets, cache offsets. At runtime, when the JIT wants to compile a trace, it copies the template bytes for each uop into an executable memory buffer and patches the placeholders with the concrete values for that specific trace.

No compiler runs at runtime. The cost of JIT compilation per trace is measured in microseconds rather than milliseconds. The trade-off is that the emitted code reflects what Clang did with the C implementation of each uop at build time, without register allocation across uop boundaries or optimization passes that require seeing the full trace as a unit.

Before JIT compilation, CPython’s tier-2 system decomposes specialized bytecodes into uops. BINARY_OP_ADD_INT decomposes into roughly:

_GUARD_BOTH_INT      # bail out if either operand is not a plain int
_BINARY_OP_ADD_INT   # overflow-checked integer add
_RETURN_VALUE        # or continue, depending on context

The optimizer works at this uop level, performing type propagation, dead code elimination, and guard elision. When the optimizer can prove a guard is redundant given what it already knows about a value’s type from earlier in the trace, it removes it. Each eliminated guard is one fewer conditional branch in the hot path, which reduces both code size and branch predictor pressure.

Why the JIT Stalled

The 3.13 JIT, enabled at build time with --enable-experimental-jit, showed near-neutral or mildly negative performance on the pyperformance benchmark suite. Several factors contributed.

Guard overhead. The optimizer was not aggressive enough at eliminating redundant type checks. Even in traces where the adaptive interpreter had already specialized all the relevant opcodes, the uop sequence retained guards that the optimizer could not yet prove were unnecessary. Executing those guards in native code costs less than the interpreter’s dispatch overhead, but not enough less to make the JIT consistently profitable.

Executable memory pressure. The JIT allocates memory pages for its code output using platform memory APIs. Poorly managed allocation spread JIT code across multiple pages without regard for spatial locality. On a tight benchmark loop, the CPU instruction cache had to hold both JIT code and the surrounding CPython runtime, and the evictions were measurable. Grouping related traces and aligning code to cache-line boundaries matters more than it might seem for the short, frequently-executed traces that are the JIT’s primary target.

Trace selection calibration. The heuristic for deciding which loops to JIT-compile was not well-tuned. The JIT was spending compilation budget on code paths that were hot enough to trigger compilation but not hot enough relative to their emitted code quality to recover the allocation and patching overhead.

Free-threaded build interaction. CPython 3.13 also introduced the experimental free-threaded build described in PEP 703. The JIT and the free-threaded interpreter had to coexist, and thread-safe reference count handling in JIT-emitted code added branching that the single-threaded path did not need, further compressing the margin the JIT had over the adaptive interpreter.

What Changed for 3.15

Ken Jin’s post on the JIT being back on track describes concrete improvements for 3.15. The optimizer has been substantially strengthened. Abstract interpretation passes now track type information across longer traces and across more of the uop types, which means guard elision fires more frequently. The memory allocation strategy for JIT code has been revised to improve spatial locality. Template generation at build time has also been improved to ensure Clang produces tighter code for the most common uop implementations.

The net result is that pyperformance numbers are showing genuine positive improvement relative to running without the JIT. Tight numeric loops, the best case for any JIT, show the largest gains. Object-attribute-heavy benchmarks, where the adaptive interpreter was already specialized effectively, show more modest ones. The key change is not a single dramatic optimization but a collection of fixes that together push the JIT past the break-even point across enough of the benchmark suite to make it meaningful.

The PyPy Comparison and Why It Does Not Apply

PyPy’s tracing JIT is the reference point that comes up in every discussion of Python performance. It delivers 4-7x speedups over CPython on CPU-bound pure Python workloads. The technique is different: PyPy records the actual execution trace of a hot loop, including calls into Python functions, and compiles that trace to native code with full optimization. This includes escape analysis, unboxing integers to machine registers without heap allocation, and inlining across call boundaries.

CPython’s JIT will not reach those numbers, and it is not trying to. The copy-and-patch approach is deliberately conservative. Each uop template corresponds to a C function that handles the general case with reference counting and heap allocation. The optimizer can eliminate guards and dead operations within a trace, but it cannot replace heap-allocated Python integers with unboxed machine integers the way a tracing JIT with escape analysis can. The ceiling on what copy-and-patch can achieve is lower, but the compilation cost is also orders of magnitude lower.

The practical advantage CPython’s JIT has over PyPy is that it is still CPython underneath. C extension modules, the NumPy ecosystem, anything using the CPython C API, works without modification. PyPy’s compatibility story has improved through cffi and its cpyext layer, but the ecosystem friction is real for projects with deep C extension dependencies. CPython’s JIT accepts modest gains in exchange for full compatibility with the existing ecosystem.

JIT ApproachCompilation CostPeak SpeedupC Extension Compatibility
CPython copy-and-patchMicroseconds5-30%Full
PyPy tracing JITMilliseconds400-700%Partial (cffi/cpyext)
LuaJIT tracingSub-millisecond200-500%N/A
V8 TurboFan10-100msVariableN/A

What to Expect

The realistic target for Python 3.15’s JIT is a 5-15% improvement on CPU-bound workloads relative to running CPython without JIT. Benchmarks with tight integer arithmetic or floating-point loops will land at the higher end. Benchmarks dominated by C extension calls, I/O, or highly polymorphic attribute access will see less, and some may remain near-neutral.

That range matters most for a specific category of Python code: hot enough that performance is felt, pure Python enough that NumPy or a C extension is not the obvious answer, and with enough type stability that the adaptive interpreter’s specializations hold. Web framework templating, certain numerical algorithms, parsers written in pure Python. These are the use cases where a 10% improvement at no cost, no rewrite, no dependency on a separate runtime, is genuinely useful.

The more significant milestone is what getting to break-even makes possible. Once the JIT’s accounting is reliably positive, improvements to the optimizer compound. Better guard elision, longer trace superblocks, abstract interpretation that can propagate type information across loop iterations: all of these have higher payoff once the baseline JIT overhead is not competing against the gains. The Faster CPython project has been methodical since PEP 659, and the JIT’s slow path to profitability fits that pattern. Getting to neutral was the hard part.

Was this interesting?