· 7 min read ·

Twenty Years of Python JIT Failures, and Why CPython 3.15 Avoids Them

Source: lobsters

The news that Python 3.15’s JIT is back on track is easier to appreciate with some history behind it. CPython’s path to a functioning JIT has spanned more than twenty years and included at least two serious prior attempts that never made it to a production release. Each failed for a different reason, and the current copy-and-patch approach in Python 3.13 through 3.15 is explicitly designed to avoid each of those failure modes. Looking at the prior attempts is the clearest way to understand what the current team is actually trying to do.

Psyco: The First Serious Attempt

Psyco was the first production-quality JIT for CPython. Armin Rigo wrote it starting around 2002. It worked through object-level specialization: Psyco compiled Python code to specialized machine code for particular object types, falling back to interpreted execution when types did not match expectations. For tight numeric loops it produced real speedups, sometimes 10-100x over CPython 2.x on the benchmarks people ran at the time.

Psyco never made it into CPython proper for several reasons. It was architecture-specific; only x86 was supported, at a time when 64-bit computing was becoming standard. It had its own memory management for compiled code that did not integrate cleanly with the rest of the interpreter. And maintaining a full machine-code emitter as a Python extension module, written partly in C and partly in a Python DSL for generating assembly, was ultimately unsustainable for a small project without institutional support. Armin Rigo, recognizing that the architecture had hit its ceiling, moved his effort to the PyPy project, and Psyco was officially unmaintained by 2012.

Unladen Swallow: The LLVM Bet That Did Not Pay Off

Unladen Swallow was a Google project that ran from 2009 to 2011 with the goal of making CPython 5x faster using LLVM as a JIT backend. The approach was technically plausible at first glance: compile Python bytecode to LLVM IR, run LLVM’s optimization passes, emit native code.

The project failed to meet its performance targets and was never merged into CPython. The post-mortems written by participants identified several root causes. LLVM’s strength is optimizing statically typed code with well-understood value lifetimes. Python’s dynamic semantics resist the optimizations LLVM’s IR is built to perform: objects can change type at any moment, reference counting operations appear everywhere, and calls can have arbitrary side effects. The optimization passes ran but found little to optimize. Meanwhile, the overhead of JIT compilation via LLVM was substantial for the short, hot code paths that matter most in typical Python programs. The break-even point, where improved code quality paid back the compilation cost, was rarely reached.

What Unladen Swallow established was that a powerful general-purpose compiler backend is not automatically a good JIT for a dynamic language. The gains come from knowing types, and LLVM’s passes could not derive types that Python does not expose statically.

PyPy: A Performance Success That Did Not Become the Standard

PyPy has a tracing JIT that genuinely works. On CPU-bound pure Python code it runs 3 to 10 times faster than CPython. The engineering is impressive: PyPy records a linear trace of a hot loop’s execution, including type specializations, and compiles that trace to optimized native code. It can unbox integers to machine registers, eliminate allocation for short-lived objects using escape analysis, and inline across call boundaries.

PyPy is not the CPython default not because of performance, but because of the C extension ecosystem. CPython exposes a C API that tens of thousands of packages use: NumPy, SciPy, PIL, lxml, cryptography, most database drivers. All of these have C extension components built against CPython’s internal object representation. PyPy’s cpyext compatibility layer implements that API through a translation shim, because PyPy’s internal object layout differs from CPython’s. The shim works, but at a cost that eliminates much of the JIT’s gain on code that touches C extensions heavily.

This matters because Python’s dominance in scientific computing, machine learning, and data processing is built on the C extension ecosystem. A faster runtime that cannot run NumPy and SciPy efficiently at the C level is not a viable replacement for CPython in those domains, regardless of its benchmark numbers on pure Python workloads.

What the Current Approach Is Designed to Avoid

The copy-and-patch JIT introduced in PEP 744 by Brandt Bucher is structured to avoid each of the above failure modes.

Against Psyco’s maintainability problem: the uop templates for copy-and-patch are generated at build time by Clang from ordinary C code. There is no bespoke assembler to maintain. Clang emits machine code for each uop template, and a Python script processes the output to identify relocatable holes. Adding a new uop means writing a C function; it does not require low-level assembly knowledge or architecture-specific code generation logic in the runtime.

Against Unladen Swallow’s compilation cost problem: no compiler runs at runtime at all. Template stubs are precompiled at build time. JIT compilation at runtime is a memcpy followed by patching concrete addresses and constants into placeholder locations. This takes microseconds, not milliseconds, which changes the break-even calculus entirely. Short, hot functions can be JIT-compiled without the compilation overhead dominating the gains from better code.

Against PyPy’s ecosystem incompatibility problem: it is still CPython. The C extension API is identical. NumPy, SciPy, and every package that targets CPython’s C API works without modification, without a compatibility shim, without any porting effort. The JIT is an optimization layer on top of the same interpreter that already runs the existing ecosystem.

The Register Allocation Problem and Why It Took Until 3.15

Avoiding the previous failure modes did not mean avoiding all problems. The 3.13 JIT shipped and benchmarks showed it near-neutral to mildly negative on real workloads. The explanation involves what copy-and-patch cannot do without additional infrastructure: without a register allocator, each uop stub loads its inputs from memory and stores its outputs back to memory. Even when the tier 2 optimizer has proven that a value flows directly from one uop to the next with no intervening code, the emitted machine code still bounces that value through a stack slot.

On modern x86-64 hardware, register-to-register operations are effectively free; a stack slot access costs L1 cache latency at minimum, and the cost compounds in tight loops. This memory traffic absorbed the savings from eliminating interpreter dispatch overhead, leaving the JIT at roughly break-even. The pyperformance benchmark suite reflected this clearly through the 3.13 and early 3.14 cycles.

The fix is a liveness analysis pass and a register allocator in the JIT backend, which is what the 3.15 work adds. The tier 2 optimizer already tracked type and liveness information; the code generator can now use it to keep live values in registers across uop boundaries. This converts metadata that was previously unused at emit time into actual cycles saved. According to Ken Jin’s writeup, the combination pushes pyperformance numbers to consistent positive improvement over running without the JIT, which is what makes enabling it by default defensible for 3.15.

The Conservative Path Has Costs

What copy-and-patch trades away is ceiling. PyPy can unbox integers to machine words, eliminating heap allocation for arithmetic entirely in loops where escape analysis can confirm objects do not escape. CPython’s JIT cannot, because each uop template handles the general reference-counted Python object case, and there is no escape analysis operating at the right level to elide allocation. Compute-bound code that does a lot of pure integer arithmetic will always run faster on PyPy.

For the Python ecosystem as it actually exists, that trade is likely worth taking. The benchmark that matters for most Python users is not a tight numeric loop in pure Python; it is a web request handler that touches SQLAlchemy, serializes JSON, and calls into C extension libraries for cryptography or parsing. For that workload, a 10-15% improvement from a JIT that requires no special build flags and fully supports the C extension ecosystem is more useful than a 5x improvement that requires porting to a separate runtime.

The 3.15 cycle is the first point where CPython’s JIT looks likely to deliver that 10-15% consistently across a broad workload mix. Getting there required building the three-tier execution system incrementally across three releases, discovering the register allocation gap through real-world benchmarking rather than speculation, and fixing it at the right level. That is slower than a big-bang redesign would have been. It is also more likely to result in a JIT that stays in CPython permanently, because it was built on the architecture that the rest of the project already understood and maintained.

Twenty years of attempts produce a clear pattern: the JITs that showed the best benchmark numbers failed to ship in CPython, and the approaches that failed to ship had specific architectural reasons for doing so. The current approach chose a lower performance ceiling in exchange for avoiding each of those specific failure modes. Whether 3.15 makes that tradeoff look good depends on whether the register allocator delivers the benchmark improvements the team is targeting. The early results suggest it will.

Was this interesting?