The Forward Pass That Executes Instructions: How a CPU Fits Inside Neural Network Weights
Source: lobsters
The Forward Pass That Executes Instructions
nCPU implements a CPU as a collection of neural network weights and runs it on GPU. The premise sounds paradoxical: GPUs exist to run neural networks, and CPUs are what we use to train and coordinate those networks. Running a CPU inside a neural network that runs on a GPU is the kind of thing that might seem like a stunt until you work through the mathematics, at which point it becomes something more interesting.
Boolean Logic Has Always Been Linear Algebra
The connection between threshold neurons and boolean gates predates modern deep learning by decades. Warren McCulloch and Walter Pitts proposed their neuron model in a 1943 paper precisely because they wanted to show that nervous systems could compute any boolean function. The McCulloch-Pitts neuron computes a weighted sum of binary inputs and fires if that sum exceeds a threshold. That is, structurally, what a logic gate does.
The concrete mapping is exact, not approximate:
AND(x, y) = step(x + y - 1.5) # fires only when both inputs are 1
OR(x, y) = step(x + y - 0.5) # fires when at least one input is 1
NOT(x) = step(0.5 - x) # fires when input is 0
NAND(x, y) = step(2.5 - x - y) # universal gate
Where step(t) returns 1 if t > 0, 0 otherwise. These are not approximations. Given binary inputs, each expression produces exactly the correct boolean output.
Because NAND is functionally complete, every combinational logic circuit is representable as a feedforward neural network with threshold activations and integer weights. A CPU datapath is a combinational logic circuit. Therefore a CPU is, exactly and precisely, a neural network. The nCPU project implements this directly.
What CPU Architecture Looks Like as Network Layers
A simplified RISC pipeline has five stages: fetch, decode, execute, memory access, writeback. Each stage is a combinational logic block. Each logic block maps to a layer in the network. Registers hold state between cycles, and register state can be represented as binary vectors passed forward between forward passes.
The ALU is a particularly clean example. A 1-bit full adder for inputs A, B and carry-in Cin produces outputs Sum and Cout:
Sum = XOR(XOR(A, B), Cin)
Cout = OR(AND(A, B), AND(XOR(A, B), Cin))
Unrolling this into threshold neurons produces a small two-layer network. A 32-bit ripple-carry adder is thirty-two of these chained together; a carry-lookahead adder trades depth for parallelism by computing carry signals concurrently. Either way, the neural network representation is structurally identical to the hardware: neurons as gates, layer depth as propagation delay, weight magnitude as wire strength.
The control unit follows the same principle. Instruction decoding maps an opcode to a set of control signals, which is a lookup that a single linear layer with threshold activation handles directly. The whole pipeline becomes a fixed-weight network where one forward pass corresponds to one clock cycle.
The Memory Problem and Its Neural Solution
Arithmetic is straightforward. Memory is harder, because real memory access involves addressing, which is a computation over addresses rather than just over data values.
DeepMind’s Neural Turing Machine (2014) solved this problem for learned systems by replacing hard address indexing with differentiable attention over a memory matrix. The NTM controller outputs a query vector; the memory bank contains key-value pairs; a softmax over dot products between the query and all keys produces a probability distribution over memory locations. Reading is a weighted sum of all values under that distribution. When the softmax temperature is low, this approaches a hard lookup at the nearest matching address.
The Differentiable Neural Computer (2016) extended this with dynamic allocation and temporal link matrices, so the network could traverse sequences of writes in order. Both systems use the same underlying operation that transformer attention uses:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
The V matrix is memory contents, K is addresses or address representations, Q is the current read request. With sufficiently sharp softmax, the behavior is equivalent to a traditional load or store instruction.
For a neural-network CPU, this is the natural mechanism for the memory bus. Each memory access is an attention computation over the full address space. The weight matrices that implement the attention can be constructed directly from the desired addressing behavior; no gradient descent required.
Why GPU Execution Follows Naturally
The reason nCPU runs on GPU rather than CPU is the same reason neural networks generally run on GPU: each operation is a matrix-vector multiply, and GPUs execute matrix multiplications orders of magnitude faster than CPUs for large matrices.
A modern H100 delivers roughly 2,000 TFLOPS in FP16, and tensor cores are specifically designed for the matmul pattern. Each clock cycle of the neural CPU corresponds to a forward pass through a fixed weight matrix, which is a single large matrix-vector product mapping directly onto what the GPU’s SIMT architecture does best.
The bottleneck is serial dependency. A CPU program is a sequence of instructions where instruction N typically depends on the result of instruction N-1. The GPU cannot batch across clock cycles because of this dependency chain; every cycle requires the full result of the previous one before it can begin. The GPU’s massive parallelism only helps within a single clock cycle, not across them.
Where this becomes interesting is parallel instantiation. Running ten thousand independent copies of the same CPU simultaneously on a single GPU is efficient, because each clock cycle is now a batched matrix multiply with batch size ten thousand. Hardware verification and simulation workloads often require exactly this: running the same instruction stream across many independent initial states to verify deterministic behavior or sweep through a parameter space. For that use case, the GPU substrate makes genuine sense.
The Theoretical Context
The result that any boolean circuit is a threshold network is classical. What has changed recently is interest in the reverse question: what does it mean for a neural network to be a computer?
The RASP framework from Weiss, Goldberg, and Yahav (2021) showed that a transformer layer can exactly implement any RASP program, a language where operations correspond to parallel prefix computations over sequences. This places transformers in the same computational complexity class as circuits of bounded depth.
Looped Transformers as Programmable Computers (Yang et al., 2023) went further: a single transformer block applied repeatedly to the same residual stream can simulate an arbitrary computer program. The residual stream acts as working memory, each attention head is a register read, and the MLP block computes the execute stage. The key-value cache of an autoregressive model, in this framing, is a writable memory that persists across cycles.
This is the same architecture nCPU implements, but via the direct boolean-circuit route rather than the attention-as-CAM route. Both paths arrive at the same place. The CPU is not fighting the neural network structure; it was always isomorphic to one.
Connection to Quantization Research
There is an unexpected convergence with recent quantization work. BitNet b1.58 (Microsoft Research, 2024) constrains transformer weights to {-1, 0, +1}, the exact value set that maps to threshold neuron logic. In that scheme, matrix multiplication collapses into conditional addition and zero-skipping, with no floating-point multiply required. The inference kernel ends up doing something structurally similar to what a gate-level simulation does.
The limit case of weight quantization is a network that operates in binary arithmetic, which is a boolean circuit, which is a CPU datapath. The line between a heavily quantized neural network and a traditional digital logic circuit is thinner than it appears. nCPU makes that boundary explicit by building from the logic side rather than the continuous-weights side.
What This Is For
nCPU is not a replacement for running programs on real hardware. The latency difference is enormous: a native CPU instruction completes in fractions of a nanosecond; a matrix-multiply-per-cycle implementation running on GPU incurs microseconds of overhead per cycle. No workload that cares about throughput would use this approach for serial computation.
What it demonstrates is the degree to which computation and neural networks are the same underlying mathematics. CPUs were designed with transistors because transistors implement threshold logic cheaply and at scale. Neural networks are built from the same primitives. Running a CPU inside a neural network that runs on a GPU built to accelerate neural networks is less a technical paradox and more a consequence of the fact that these systems were never as different as their histories suggested.
The practical applications, where they exist, are in simulation and verification: cases where you want to run many CPU instances in parallel and leverage the GPU’s throughput for the batch dimension. The conceptual application is more durable. Every time the boundary between classical computing and learned systems blurs a little further, it is worth tracing the mathematics to understand why the boundary was ever thought to be sharp.