Python 3.15's JIT Improvement Is Real, but Most of Your Code Won't See It
Source: hackernews
Ken Jin’s announcement that Python 3.15’s JIT is back on track is a genuine engineering milestone. After two cycles where the copy-and-patch JIT shipped as an experimental flag and barely broke even on pyperformance, the 3.15 work on type inference and cross-stencil register allocation is showing consistent positive results. The problem is that “positive results on pyperformance” and “faster for your application” describe two different things, and the gap between them is wider than most coverage acknowledges.
The Counter Threshold and What It Filters Out
Before the JIT touches any code, that code has to clear several gates. The specializing adaptive interpreter introduced in Python 3.11 already rewrites hot bytecodes in place after roughly eight executions: BINARY_OP becomes BINARY_OP_ADD_INT, LOAD_ATTR becomes LOAD_ATTR_INSTANCE_VALUE. The tier-2 optimizer that feeds traces to the JIT only engages after a backward branch has been taken substantially more times than that. Across CPython 3.13 and 3.14 development, the order of magnitude is several hundred to a few thousand loop iterations before a trace is formed, with additional iterations before the JIT compiles that trace to native code.
Code that does not run in tight loops for many iterations simply never reaches JIT compilation. A request handler that runs once per HTTP request, parses some JSON, queries a database, and returns a response may have hot spots in routing or serialization logic, but those spots are unlikely to cross the trace-formation threshold unless traffic is extremely high and the process is long-lived. Short-running scripts are excluded entirely. If your Python process exits before any backward branch crosses the compilation threshold, the JIT contributes zero speedup and a small amount of overhead in the tier-2 bookkeeping.
This is precisely why V8 added Ignition as a bytecode interpreter tier despite launching as a pure JIT compiler in 2008. Short-lived code needs fast startup more than it needs optimized execution, and the JIT tier only pays for itself on code that runs long enough to justify compilation cost.
What the JIT Actually Targets
The workloads that benefit clearly from the 3.15 improvements are tight Python loops with type-stable operands. The pyperformance benchmark suite includes things like floating-point arithmetic loops, n-body simulation written in pure Python, numerical integration routines, and benchmarks that compile regular expressions using CPython’s re module. These run for hundreds of thousands or millions of iterations over values whose types do not change mid-execution. For those inputs, the JIT’s improvements in type inference and register allocation translate directly into speedup.
They are real benchmarks, but they are not representative of most production Python code. The pyperformance geometric mean is a useful aggregate for tracking CPython progress across releases. It is not a predictor of what happens when you deploy a Django application or a Celery worker.
The Categories That Don’t Qualify
I/O-bound code spends most of its time waiting on network sockets, database queries, or file reads. The Python interpreter is idle during those waits. The JIT can speed up Python code surrounding I/O calls, but that code is typically not a hot loop. It is a sequence of function calls, dictionary lookups, and string formatting that runs once per I/O round-trip, never reaching the trace-formation threshold.
Async Python is a significant portion of modern workloads. asyncio event loops, frameworks like FastAPI and Starlette, anything built on Python’s coroutine model. Coroutines have a different execution profile from synchronous loops: they yield frequently, re-enter at different suspension points, and may execute only a few dozen Python instructions per await. Trace formation through coroutine suspension points is more complex, and the tier-2 optimizer handles them conservatively. The 3.15 improvements target synchronous hot loops; async code paths remain a separate and harder problem.
C extension-heavy code is arguably the most important category for deployed Python. NumPy, Pandas, PyTorch, SQLAlchemy, Pillow: all of these spend their time inside C or C++ code reached through function calls from Python. The CPython JIT operates on Python-level uops. A Python loop that calls into NumPy inside its body can have its surrounding Python overhead compiled, but cannot touch what happens inside NumPy itself. For code that spends 95% of its wall time inside C extensions, a 15% improvement on the surrounding Python is roughly a 0.75% improvement end-to-end.
Generator pipelines used in data transformation are also handled conservatively. A generator expression creates a generator object that yields lazily; tracing through generator boundaries involves capturing suspended frame state and re-entering at a yield point, which the tier-2 optimizer approaches carefully. Comprehensions that materialize into lists are better candidates than lazy generators, but list comprehensions with complex conditions tend to terminate before crossing the hot threshold.
The Boxing Ceiling
There is a harder limit beneath the scope conditions. Even when the JIT compiles a trace, it cannot eliminate Python object boxing the way PyPy’s tracing JIT can. PyPy includes escape analysis: if an integer object is created, used within a loop body, and never referenced outside the trace, PyPy eliminates the heap allocation entirely and keeps the value as an unboxed machine integer in a register. A loop computing x = x + 1 a million times never allocates a million PyLongObject instances. The integer lives in a CPU register throughout.
CPython’s copy-and-patch JIT cannot do this. The uop stencils are compiled against CPython’s reference-counted object model. The type inference improvements in 3.15 let the optimizer keep unboxed integers in registers across stencil boundaries for adjacent operations, but the stencils still assume they receive and produce Python objects at their boundaries. A value known to be an integer gets faster handling at operation boundaries, but the PyLongObject representation and its reference count remain for values that cross trace-visible boundaries.
This is why the 3.15 gains cluster around 5-20% rather than the 300-700% PyPy achieves on equivalent benchmarks. The gains come from eliminating interpreter dispatch overhead and redundant type checks. They do not come from eliminating allocation or reference count operations.
What Already Exists for the High-Performance Cases
Python’s ecosystem has had JIT-adjacent solutions for years, targeting the cases CPython’s built-in JIT cannot reach.
Numba is an LLVM-based JIT for NumPy arrays and type-stable Python functions. It compiles decorated functions to native code at first call, unboxes arrays to raw C pointers, and generates vectorized SIMD instructions. For numerical loops over NumPy arrays, Numba can reach within 2x of hand-written C. It requires explicit decorators and constrains which Python features the decorated function can use, but for the right workloads the performance ceiling is not in the same conversation as what CPython’s JIT will deliver.
Mypyc compiles type-annotated Python to C extensions ahead of time. The mypy type checker uses it to compile itself, achieving roughly 4x the throughput of running mypy under CPython. Mypyc requires full type annotation and a compilation step, but it stays within the CPython C extension ecosystem because the output is a C extension module.
Cython occupies a similar space with a longer track record. Writing Python with Cython’s type declaration syntax and compiling to a C extension can bring specific hot functions close to native performance. None of these tools become obsolete when CPython’s JIT matures; they occupy a different point on the tradeoff curve, trading annotation effort and compilation steps for larger speedups.
What 3.15 Actually Delivers
The crossing from “experimental overhead” to “consistent improvement” is a prerequisite for enabling the JIT by default, which is the milestone that makes all of this relevant to Python developers who are not manually building CPython with --enable-experimental-jit. Enabled by default is when a 10% improvement on CPU-bound pure Python becomes something a developer can encounter without knowing the JIT exists.
The realistic expectation for 3.15 is a 5-15% improvement on CPU-bound workloads that spend meaningful time in Python hot loops with type-stable values. That improvement will not appear in applications where I/O, C extensions, or async code dominate execution time, which describes a large fraction of production Python.
What matters is the trajectory. The specializing adaptive interpreter in 3.11 delivered roughly 25% by specializing individual instructions. The JIT in 3.15 adds another increment above that. Each improvement to the tier-2 optimizer’s type inference increases what the code generator can do without any changes to user code. Escape analysis, heap allocation elimination, cross-call inlining: these are possible future targets once the current layer is stable and trusted. The faster-cpython team’s design documents have been public about the incremental approach, and that approach has so far proven correct in terms of landing real improvements without breaking the C extension ecosystem.
For Python developers on the standard CPython runtime, 3.15 is worth benchmarking against specific hot paths, with the understanding that pure Python loops are the place to look first. For everything else, the JIT is infrastructure that makes the next several years of CPython optimization possible, not a change that shows up in your web server’s p99 latency next quarter.