· 7 min read ·

The Stack Machine Hiding in Every Python Function Call

Source: lobsters

Python hides a lot of machinery between the code you write and the CPU instructions that eventually run. Most developers know there is a compilation step, that .pyc files exist, and that CPython is slower than languages that compile natively. What fewer people trace is the precise sequence of operations bridging source text to execution. The 500 Lines or Less chapter on a Python interpreter written in Python, by Allison Kaptur, walks through exactly that sequence by building a minimal bytecode interpreter called Byterun. The result is a working Python interpreter in under 500 lines of Python. More than the code itself, the exercise reveals specific design choices embedded deep in CPython that are invisible when you are simply using the language.

From Source to Bytecode

Python’s compilation pipeline has two distinct stages. The first is familiar: the source file is parsed into an AST, the AST is compiled into bytecode, and that bytecode is cached in .pyc files so the parse step can be skipped on re-runs. The second stage is less discussed: the bytecode is executed by the interpreter at runtime. Byterun sits entirely in this second stage. It takes already-compiled Python bytecode as input, which means you can test it by compiling Python code with the standard compile() built-in and passing the resulting code object to Byterun’s virtual machine.

The dis module lets you inspect this intermediate form directly:

import dis

def add(a, b):
    return a + b

dis.dis(add)

Which produces output like:

  3           0 RESUME                   0

  4           2 LOAD_FAST                0 (a)
              4 LOAD_FAST                1 (b)
              6 BINARY_OP                0 (+)
             10 RETURN_VALUE

Every instruction name here is a real opcode that CPython’s interpreter loop handles in Python/ceval.c. Byterun handles a subset of these, enough to execute non-trivial Python programs, but each one maps directly onto what CPython does in that main loop.

The Stack Machine Model

The most consequential architectural choice in CPython’s interpreter is that it is a stack machine, not a register machine. When you write a + b, CPython does not assign registers and issue an add instruction. It pushes a onto a value stack, pushes b onto the same stack, then executes BINARY_OP, which pops both values, adds them, and pushes the result back.

This is not the only viable design. The Lua VM is a register machine. Dalvik, the original Android runtime, is a register machine. Register machines tend to issue fewer total instructions because values move through memory less; stack machines produce simpler, more compact bytecode because instructions need not specify operands. CPython chose simplicity of bytecode over execution efficiency, a reasonable trade for a language never expected to compete with compiled code on raw throughput.

Byterun makes this explicit because you can read the value stack as a Python list. The VirtualMachine class maintains a stack as an ordinary list, and operations like push and pop are literal list appends and pops. Watching LOAD_FAST push a local variable onto a Python list, then watching BINARY_OP pop two items and push the result back, makes the model concrete in a way that reading CPython’s C source does not.

Frames as First-Class Objects

Every function call in Python creates a frame object. The frame holds the local variables for that call, a reference to the code object (bytecode plus constants and variable names), a pointer to the enclosing frame, and the current instruction pointer. When the function returns, the frame is discarded.

Byterun’s Frame class exposes this directly as a Python object with Python attributes. CPython’s PyFrameObject in Objects/frameobject.c does the same thing in C, and Python exposes it through sys._getframe() and tracing hooks. The fact that frames are objects rather than implicit stack entries has real consequences: it is precisely why Python generators work the way they do.

A generator is a suspended frame. When you call next() on a generator, CPython restores the frame’s state and resumes execution from the last YIELD_VALUE instruction. The frame never goes away between next() calls; it sits on the heap, preserving the value stack, the instruction pointer, and all local variables. This is why generators have a gi_frame attribute and why you can inspect a suspended generator’s current locals by reading gen.gi_frame.f_locals. Byterun’s implementation of YIELD_VALUE illustrates this by literally returning control to the caller while leaving the Frame object intact, resuming from the saved instruction pointer on the next call. The mechanism is transparent because the implementation language is the same as the subject language.

What the Remaining 97% of CPython Does

The 500-line constraint means Byterun omits essentially everything beyond the core interpreter loop: no garbage collector, no memory management beyond Python’s own, no C extension API, no thread state, no GIL, no optimization passes. This gap is instructive in its own right, because it shows how much of a real interpreter is not the instruction dispatch loop.

CPython’s garbage collector handles reference counting for the common case and a cyclic collector for reference cycles that reference counting cannot reach. The GIL, the Global Interpreter Lock, serializes bytecode execution across threads by allowing only one thread to run the interpreter loop at a time. This is the primary reason CPU-bound Python multithreading does not scale across cores. The C extension API, Python.h, is what NumPy, Pandas, and hundreds of other packages use to call Python objects from C and vice versa. Python 3.12 began offering a free-threaded mode that removes the GIL experimentally, but doing this correctly without breaking the C extension ecosystem is an enormous engineering project that the CPython team is still working through.

None of this appears in Byterun, yet all of it is load-bearing for the Python you actually run in production.

Alternative Implementations and the Trade-offs They Reveal

PyPy addresses the performance gap with a tracing JIT compiler that identifies hot bytecode paths and compiles them to native code at runtime. Depending on the workload, PyPy reaches 4 to 10 times CPython’s throughput on CPU-bound code, though with a slower startup time and higher memory use. PyPy’s interpreter is itself written in RPython, a restricted Python subset that can be compiled to C, which means the self-hosting concept Byterun demonstrates is not purely educational: PyPy uses it for actual production purposes.

Jython compiles Python source to JVM bytecode, gaining access to the JVM’s JIT and the entire Java ecosystem, but losing compatibility with C extensions. GraalPy takes a similar approach on top of GraalVM’s Truffle framework and achieves competitive performance through partial evaluation. Each of these alternatives exists precisely because the reference interpreter optimizes for clarity and portability over raw throughput. Understanding what Byterun implements, and what it deliberately skips, maps directly to understanding why each alternative made the trade-offs it did.

The Debugging and Introspection Layer

One thing the 500 Lines chapter spends little time on is that CPython’s frame model is intentionally exposed for debuggers and profilers. The sys.settrace() function installs a callback that CPython calls on each line executed, each function call, and each function return. The callback receives the current frame object, giving you local variables, the code object, and the current line number. This is how pdb, coverage.py, and most Python profilers work at a fundamental level.

Byterun could support a similar mechanism without much additional code: the Frame object is already a Python object, and adding a trace callback to the main dispatch loop would be straightforward. The fact that CPython’s introspection story maps so cleanly onto a minimal interpreter implementation is not accidental. Python’s design has consistently prioritized giving developers access to runtime state, even at some cost to performance. The tracing and profiling hooks are not retrofitted debugging features; they follow directly from the frame-as-object model that Byterun makes visible.

Why the Self-Hosting Angle Matters

There is something specifically useful about implementing an interpreter in the language it interprets. The abstraction boundaries become visible. When Byterun handles BUILD_LIST by calling Python’s list(), you see that the bytecode instruction is a thin wrapper over the host runtime’s own list type. When it handles MAKE_FUNCTION by constructing a Python function object, you see that code objects and function objects are not mysterious engine internals; they are Python objects the language itself can construct and inspect at any time.

This is what the 500 Lines or Less book aims at more broadly: taking systems that seem opaque or massive and showing that the core ideas fit in a page. The Python interpreter chapter works because the core ideas genuinely do fit in under 500 lines when you borrow everything else from the host runtime. What remains is the interesting part: how bytecode maps to behavior, how frames model call stacks, how a Python list serves as the value stack for the entire expression evaluation model.

For anyone who writes Python regularly and wants a more accurate mental model of what the runtime is actually doing, reading Kaptur’s implementation alongside dis.dis() output on functions you write is a more concrete exercise than any conceptual overview. The gap between the phrase ‘Python is interpreted’ and the reality of an instruction pointer advancing through opcodes while a list serves as the value stack is smaller than it looks, and Byterun is the shortest path across it.

Was this interesting?