· 6 min read ·

The Recursive Machine: What It Takes to Build a CPU Out of Neural Network Weights

Source: lobsters

The premise is deliberately recursive: use a GPU, a piece of silicon purpose-built for neural network inference, to run a neural network that implements a CPU. nCPU by robertcprice does exactly this, and its existence is a working demonstration that digital logic and learned computation are not as categorically different as they might appear.

Boolean Logic Is Linear Algebra With a Threshold

Every introductory machine learning course covers the XOR problem: you cannot draw a single straight line to separate the truth table of XOR, which is why Minsky and Papert’s 1969 critique of perceptrons landed so hard. But the flip side of that observation is equally important. With two layers, you can implement XOR. With enough layers and neurons, you can implement any Boolean function.

A single neuron with a step activation implements AND or OR depending on its threshold and weight configuration. NOT is a weight of -1 with an appropriate bias. NAND, the universal gate from which all digital logic can be constructed, requires two layers. The chain from NAND to a full adder to an ALU to a complete CPU is long but fully traversable in weight space.

This is not a new observation. What makes it technically interesting is whether you can run all of it as tensor operations on a GPU, at what fidelity, and with what performance characteristics.

Why a GPU Is a Natural Substrate for This

GPUs were not designed to run CPUs. They were designed for shader pipelines, then repurposed for scientific computing, then retooled around the GEMM kernels (general matrix-matrix multiplication) that neural network training demands. The result is hardware that excels at one class of operation: multiply a large matrix by a vector, add a bias, apply a pointwise nonlinearity, and repeat.

If you encode CPU state as a vector and CPU operations as weight matrices, forward passes through a neural network become a way to simulate state transitions. A register file holding N bits can be represented as an N-dimensional binary vector. An instruction’s effect on that register file can be expressed as a hand-coded or learned weight matrix. Running the forward pass advances the CPU by one clock cycle.

The practical problem is that floating point matrix multiplication does not naturally produce clean binary outputs. If you represent register values at the bit level, you need your activations to behave like step functions at bit boundaries. This is the domain of binarized neural networks, explored in depth by Courbariaux, Hubara, and colleagues in their 2016 NeurIPS paper, though their goal was compression rather than logic emulation. The technique is the same: force weights and activations to ±1, and use the straight-through estimator to allow gradients to flow through the discontinuity.

Prior Work and the Differentiable Computing Lineage

nCPU sits in a lineage of work probing the boundary between discrete computation and continuous, differentiable models.

The Neural Turing Machine, introduced by Graves, Wayne, and Danihelka at DeepMind in 2014, showed that a neural network augmented with external memory could learn algorithmic tasks: copying, sorting, associative recall. The NTM was differentiable end-to-end, meaning it could be trained with gradient descent. The controller was a learned neural network; memory addressing was a soft, continuous approximation of discrete read/write operations.

The Differentiable Neural Computer extended this in 2016 with more sophisticated memory management, including temporal linkage between memory locations. Both systems are neural networks that learn to compute, rather than neural networks that are configured to implement specific computations.

nCPU takes a different approach: rather than training a network to perform some high-level task, it constructs a network whose weights implement the literal gate-level logic of a CPU. The network is not learning anything in the usual sense; it is a fixed computational graph expressed in the vocabulary of neural network operations.

There is also a connection to the transformer-as-state-machine literature. The 2021 paper Thinking Like Transformers by Weiss, Goldberg, and Yahav formalized the computational class of transformer models through a programming language called RASP, showing that transformers can simulate finite state machines and certain classes of formal grammars. The thread connecting these projects is the same question: what kinds of discrete computation can be expressed as feedforward tensor operations, and how far does that equivalence extend?

What Implementing a CPU in Weights Actually Requires

Consider what even a minimal CPU needs. A fetch-decode-execute loop requires a program counter that increments, an instruction fetch from memory, decoding to distinguish opcodes, ALU operations (addition, subtraction, bitwise AND/OR/XOR, shifts), load/store access, and conditional branching.

Each of these involves Boolean operations at the bit level. Addition of two N-bit numbers requires N full adder circuits chained through carry propagation. In a neural network, this means encoding each bit position as a separate dimension, with weight matrices routing carry bits through the chain. The network depth for a single add operation scales with bit width because ripple carry is inherently sequential.

There are two approaches to this. The first is to hand-code the weights: set them deterministically to implement known logic circuits and use the GPU purely as a matrix multiply accelerator. This is conceptually clean, but it raises the question of why you would do this rather than just simulate the CPU directly in software using conventional arithmetic.

The second approach is more interesting: use gradient-based optimization to discover weights that implement target behaviors. You define the desired input-output relationship for each instruction and train the network to match it. This is closer to what electronic design automation tools do, and it opens the door to hardware-software co-optimization in ways that conventional toolchains do not support. Tools like Yosys compile HDL descriptions to gate-level netlists through deterministic synthesis passes. A differentiable version of this process would let you backpropagate through the circuit, which is a qualitatively different kind of design tool.

The Performance Ceiling Is Beside the Point

The practical utility of nCPU as a CPU is close to zero. A single instruction on a modern x86 processor executes in a fraction of a nanosecond. Running that same instruction as a forward pass through a multi-layer network on a GPU will be slower by orders of magnitude and will consume far more power. Nobody should run production workloads on this.

But performance is the wrong frame for evaluating this kind of project. The interesting question is not how fast it is, but what it demonstrates.

First, it makes the abstraction boundary between hardware and software more porous than it is usually presented. A CPU is one physical instantiation of a computation graph, optimized for latency and energy efficiency through decades of semiconductor engineering. A neural network running on a GPU is a different physical instantiation of a computation graph, optimized for throughput on dense matrix operations. If the same computations can be expressed in both, the distinction between them is engineering and economics rather than principle.

Second, there is genuine educational value in making this connection explicit. Students learning about CPUs encounter transistors, logic gates, and timing diagrams. Students learning about neural networks encounter matrix multiplication and gradient descent. Projects like nCPU make the underlying equivalence concrete in a way that textbooks rarely do. A NAND gate is a neuron with particular weights. A ripple carry adder is a computation graph with N layers of carry propagation. The boundary dissolves when you look at it directly.

Third, if the weights implementing a CPU are discoverable via gradient descent, you have a path toward synthesizing hardware logic from behavioral specifications automatically. Specify what you want the circuit to do, provide input-output pairs, and let an optimizer find the weights. The result is something closer to program synthesis than circuit design, and it runs on the same infrastructure used to train language models.

What Comes Next

The tractable goal for a project at this stage is a simple in-order pipeline implementing a minimal instruction set. An architecture like RISC-V’s RV32I base integer set, with 47 instructions and fixed-width 32-bit encoding, is a reasonable target: small enough to be implemented completely, well-specified enough to verify correctness, and widely understood enough that the results are interpretable.

Getting that to run correctly on a GPU, even slowly, would be a meaningful result. It would confirm that the mapping from gate-level logic to tensor operations is complete enough to run real programs rather than toy examples, and it would produce a differentiable CPU model that could be used as a component in larger learned systems.

What nCPU represents is an invitation to think about computation as a substrate-agnostic phenomenon. The weights that implement a NAND gate work whether they are running on a GPU, a TPU, or a cluster of neurons. That observation does not change what computers are, but it does change what you think they could be.

Was this interesting?