· 7 min read ·

What It Actually Means to Execute Programs Inside a Transformer

Source: hackernews

A Percepta AI post from earlier this week makes a striking claim: you can execute arbitrary programs inside a transformer’s forward pass, and doing so yields exponentially faster inference for certain problem classes compared to standard autoregressive generation. That sounds like marketing, but the theoretical foundation is solid and the lineage of research behind it is worth understanding carefully.

The claim does not mean what most people might initially read into it. It is not saying that a GPT-4-class model secretly executes Python when you prompt it. It is saying something more specific: that transformer architectures are expressive enough to implement arbitrary computation within a fixed-depth forward pass, and that for problems amenable to parallel evaluation, the computational depth required is logarithmically smaller than the number of serial steps required by token-by-token generation.

To understand why that is true, you need to briefly visit circuit complexity theory.

Transformers and the TC⁰ Connection

Theoretical ML researchers have spent several years characterizing the computational class that transformers belong to. The key result: standard fixed-depth, log-precision transformers correspond to TC⁰, threshold circuits of constant depth. TC⁰ is a subclass of NC (the efficiently parallelizable problems), which is itself a strict subset of P (polynomial-time decidable problems). TC⁰ already contains some surprising things: sorting, integer multiplication, iterated addition, and graph connectivity all fall within it.

Merrill and Sabharwal’s 2023 paper in TACL, “The Parallelism Tradeoff,” formalizes this correspondence precisely. The depth of a transformer maps to circuit depth; the width (embedding dimension and attention heads) maps to circuit width. Fixed-depth transformers with finite precision are not Turing-complete, but they compute substantially more than people often assume.

The constraint is depth. A 96-layer transformer has 96 rounds of computation, and those rounds are sequential even if each round is massively parallel internally. For problems that require more than a constant number of serial dependencies, a fixed-depth transformer cannot compute the answer in a single forward pass.

Chain-of-Thought as Depth Extension

Before getting to the looped transformer approach, the intermediate result worth knowing is what chain-of-thought (CoT) reasoning actually does from a complexity standpoint. Merrill and Sabharwal showed that each CoT token effectively adds a layer of computation. A transformer with L layers generating T scratchpad tokens before its final answer has effectively L × T depth, allowing it to simulate any algorithm solvable in O(L × T) serial steps. Put another way, CoT lets the transformer escape TC⁰ and simulate polynomial-time computation, at the cost of linear serial token generation.

This is the theoretical grounding for why CoT improves performance on multi-step reasoning. It is not metaphorical. Each step of the CoT scratchpad is a genuine additional round of computation, implemented in transformer weights. The practical value of this result is that it explains the scaling behavior of reasoning models without requiring any new theoretical machinery.

Looped Transformers as Programmable Computers

The more exotic direction, and the one the Percepta article builds on, is looped transformers. The key paper here is Giannou et al.’s “Looped Transformers as Programmable Computers”, published at ICML 2023. They show that a single transformer block, applied T times in a loop, can simulate a universal computer by executing an arbitrary program encoded in the weights.

The construction maps cleanly onto standard computer architecture:

  • The residual stream (the running accumulation of layer outputs) serves as the register file and memory. Different positions in the sequence correspond to named registers.
  • Attention heads implement memory addressing. By constructing query, key, and value matrices that attend precisely to specific positions, you get deterministic register reads.
  • The feedforward MLP implements the arithmetic logic unit. MLPs are universal function approximators, so with sufficient width you can implement comparison, addition, conditional branching logic, and arbitrary fixed functions.
  • The loop count T corresponds to program runtime, with each iteration executing one instruction.
  • A program counter register tracks the current instruction, and attention-weighted updates to it implement conditional jumps.

The result is a formal construction proving that any program expressible in their instruction set can be compiled into a fixed transformer block that executes it correctly when looped. This is Turing-complete in the limit of unbounded loop count.

The RASP Lineage

Giannou et al. built on a foundation laid by Weiss, Goldberg, and Yahav’s RASP language (ICML 2021). RASP is a programming language designed to have a one-to-one correspondence with transformer circuits: each primitive operation in RASP maps to an attention head or MLP layer, and RASP programs can be compiled into transformer weights. The Tracr library from Google DeepMind implements this compilation, turning RASP programs into actual weight matrices that you can load and run. That is a genuinely useful tool for mechanistic interpretability research, because it lets you construct transformers with known, interpretable computations and then study how those computations are represented internally.

A simple RASP program that computes whether each token has been seen before looks like this:

has_seen_before = aggregate(
    select(tokens, tokens, ==),
    indicator(True)
)

This compiles directly into an attention pattern where each position attends to all earlier positions with matching token values. The circuit is human-readable. Tracr turns it into weight matrices.

The earlier theoretical anchor is Pérez, Marinković, and Barceló’s “Attention Is Turing-Complete” (JMLR 2021), which proved that transformers with hard attention and unbounded depth can simulate Turing machines via an attention-based tape read/write construction. The Giannou result is both more constructive and more practically specified.

What “Exponentially Faster” Actually Means

The exponential speedup claim is real, but it requires careful scoping. It applies to problems in NC, the efficiently parallelizable class. For these problems, there exist algorithms with O(log n) parallel depth using polynomially many processors. Examples include prefix sums, sorting networks, matrix-vector products, tree reductions, and FFT-like computations.

Autoregressive generation handles these sequentially: computing the nth element of a prefix sum autoregressively requires n token generation steps, each dependent on the previous. A looped transformer executing a parallel prefix algorithm needs only O(log n) loop iterations, with all sequence positions updated simultaneously in each iteration. That is where the exponential factor comes from, n versus log n.

The speedup is relative to serial token generation, not to other parallel hardware. A well-implemented CUDA kernel for prefix sums also achieves O(log n) parallel depth. The point is that the transformer’s forward pass can encode and execute this parallel structure without requiring an external parallel computer or a chain of autoregressive steps.

This matters for inference because autoregressive generation is currently the binding constraint on LLM throughput. Each token requires a full forward pass, and the tokens are serial. If certain subproblems can be offloaded to fixed-weight looped execution rather than token-by-token generation, the savings for those specific problem classes are substantial.

The Gap Between Theory and Practice

The construction is elegant; the gap to deployed models is significant.

First, the weight matrices required to implement a programmable computer are precisely engineered, not learned through gradient descent. Standard training does not produce these structures. Mechanistic interpretability research finds that trained transformers do implement recognizable algorithmic circuits (induction heads, copy circuits, modular arithmetic in small models), but these are sparse emergent structures, not systematically programmable computers.

Second, real transformers use bfloat16 or float16 arithmetic. Many theoretical results require arbitrary or log-precision, which 16-bit floats cannot faithfully represent. Merrill and Sabharwal’s 2022 paper on saturating transformers showed explicitly that when attention logits saturate at float extremes, the expressiveness of certain constructions collapses. The precision assumption is not a minor caveat.

Third, the looping mechanism is not standard. Running the same block T times during inference requires either an architectural choice baked in at training time (as in Universal Transformers, Dehghani et al., ICLR 2019) or a weight-sharing scheme applied post-hoc. Most deployed models do not have this property and cannot directly benefit from the looped construction.

None of these gaps invalidate the theoretical claims. They do mean that the path from “transformer can represent arbitrary computation” to “we should redesign inference around this” involves substantially more engineering than the blog post framing suggests.

Why This Research Direction Matters

The practical interest in this line of work is not primarily about deploying hand-crafted algorithmic transformers. It is about understanding the computational structure of the forward pass precisely enough to design better architectures and inference strategies.

Universal transformers, adaptive computation time, and mixture-of-experts routing are all architectural responses to the fixed-depth bottleneck identified by the theoretical work. If you understand that depth is the constraint, you can design mechanisms that allocate more depth where it is needed, rather than applying a fixed 96 layers uniformly across all tokens and problem types.

More speculatively, if you know which subproblems are in NC, you can design hybrid inference pipelines that route those subproblems to looped fixed-weight execution while reserving autoregressive generation for tasks that genuinely require sequential dependent reasoning. The Percepta approach is exploring this boundary in a practical context.

The broader point is that LLMs are circuits. The theoretical computer science community has characterized those circuits with increasing precision over the past five years, and that characterization has direct consequences for what kinds of inference improvements are principled versus illusory. The RASP lineage, the Giannou looped transformer construction, and the Merrill-Sabharwal circuit complexity results collectively give us a much clearer map of what transformers can and cannot compute than intuition alone provides.

The “executing programs inside transformers” framing is not a metaphor. It is a description of something the weights can provably do, under conditions that remain practically challenging but no longer theoretically speculative. That distinction is worth keeping in mind as inference-time compute becomes an increasingly active research frontier.

Was this interesting?