When the CPU Becomes a Forward Pass: Neural Networks as Computer Architecture
Source: lobsters
The nCPU project does something that sounds like a stunt: implement a CPU using neural networks and run it entirely on a GPU. The practical use case is not obvious. A GPU-backed neural CPU will never outperform a real processor at sequential computation. But that framing misses the point entirely. nCPU sits at the intersection of two research traditions that have been converging for over a decade, and understanding why anyone would build this requires knowing both.
The Differentiable Computing Lineage
The foundational paper is Alex Graves, Greg Wayne, and Ivo Danihelka’s Neural Turing Machine from 2014. The NTM coupled a recurrent neural network controller with an external memory matrix, where reads and writes used soft attention rather than discrete addressing. Every operation was a weighted sum over memory locations, with weights produced by a softmax over content-similarity scores. The result was a system that could learn to copy sequences, sort, and recall, tasks that require something like a working memory, purely from input/output examples.
The critical property was differentiability. Because every memory access was a soft blend rather than a hard address lookup, gradients could flow backward through the entire execution trace. You could train the system end-to-end by gradient descent without specifying the algorithm in advance.
DeepMind followed in 2016 with the Differentiable Neural Computer, which extended the NTM with temporal link matrices that tracked the order in which memory locations were written. The DNC could answer questions about graph structures and navigate tasks by learning programs that used memory in structured ways.
Both systems blurred the line between neural network and computer. They had memory, addressing logic, and controllers that resembled CPUs. But they were trained end-to-end for specific tasks rather than implementing a defined instruction set.
The Arithmetic Problem
A key obstacle in all of this is that standard neural networks are bad at arithmetic. They approximate it, sometimes well, but systematic generalization breaks down. A network trained on sums up to 100 does not reliably extrapolate to sums up to 1000. The weights that produce good in-distribution behavior are not structured in a way that encodes the underlying operation.
Neural Arithmetic Logic Units (NALU), from Trask et al. at Google Brain in 2018, attacked this directly. The core idea is the Neural Accumulator (NAC), where the weight matrix is constrained to values near -1, 0, or 1 by parameterizing it as W = tanh(W_hat) * sigmoid(M_hat). This forces the network to learn identity-like or negation-like transformations rather than arbitrary real-valued weights, which is what you want for counting and accumulation.
The NALU layer extends the NAC to handle multiplication through a log-space trick: represent products as sums of logarithms, apply the NAC, then exponentiate. This covers the four basic arithmetic operations with networks that generalize well outside their training range.
Subsequent work like iNALU and Real NAC refined these constraints further, addressing instabilities in the original formulation. The research direction is clear: encoding the inductive biases of arithmetic directly into network architecture produces systems that actually compute rather than merely fitting functions.
How You Build a CPU From Tensors
With that context, the nCPU approach becomes intelligible. A conventional CPU maintains a small set of registers, a program counter, and memory. At each clock cycle it fetches an instruction, decodes control signals, executes an operation, accesses memory if needed, and writes results back. Each of these steps can be expressed as tensor operations.
The program counter can be a one-hot vector of length mem_size, pointing to exactly one memory location. Fetching the current instruction is then a matrix product: instruction = PC @ M, where M is the memory matrix. The result is a vector encoding the current instruction word.
Decoding is a linear projection: control_signals = instruction @ W_decode, where the weight matrix maps instruction bit patterns to control signals. Because instruction encoding is a fixed mapping, this projection is exact rather than learned. It is just a lookup table expressed as matrix multiplication.
Register files are matrices. An instruction that reads register 3 uses a one-hot selection vector to extract the corresponding row. ALU operations on the selected values use NALU-style arithmetic modules for operations like addition and multiplication, and threshold-based activations for bitwise operations like AND, OR, and XOR.
Memory writes use the outer product: to write value v to address a (both represented as vectors), compute a^T v and add it to the memory matrix. Reads are the same attention mechanism from the NTM. The program counter advances by applying a shift operation to the one-hot vector.
The entire execution of one instruction cycle is a composition of matrix multiplications, elementwise operations, and softmax normalizations. On a GPU, this is native. Every operation maps directly to the primitives that modern deep learning frameworks like PyTorch and JAX are built on.
What Running on GPU Actually Means
The GPU execution aspect is less about performance and more about representation. When your CPU is a series of tensor operations, you can batch it. Running 1024 programs simultaneously costs roughly the same compute as running one, because the GPU parallelizes across batch dimensions for free. For workloads that require simulating many program traces, evolutionary algorithms over programs, Monte Carlo tree search over instruction sequences, or large-scale program synthesis experiments, this matters enormously.
GPU memory bandwidth also changes the tradeoff for memory-heavy workloads. The soft attention over memory that makes the whole system differentiable is O(n) in memory size per access, which is expensive. But on a GPU, that cost is absorbed into the parallel execution model in a way that sequential CPU code cannot match.
The deeper point is that representing a CPU as a neural network makes the execution trace itself a first-class object in the ML stack. You can differentiate through it. If you define a loss function on the output of a program execution, you can propagate gradients backward through every instruction, every register write, every memory access, and adjust either the program itself or the initial memory state to minimize the loss. This is program synthesis by gradient descent.
The Limits of Soft Addressing
The main tension in any differentiable CPU is between soft and hard addressing. Real CPUs use exact integer addresses. Soft attention is a weighted average over all memory locations, which is differentiable but introduces noise. For the system to behave correctly, attention weights need to be nearly one-hot, sharp enough that each read retrieves essentially a single location.
NTMs and DNCs relied on training pressure to sharpen attention. In practice this works for small memories but becomes increasingly unreliable as memory grows. Temperature scaling the softmax helps, but the fundamental problem is that soft addressing and exact computation are in tension. Architectures like Sparsemax replace softmax with a sparse projection, producing exactly zero weight on most locations and allowing exact retrieval from the dominant address. Applying this to a neural CPU would bring its memory semantics closer to a real processor.
nCPU represents the current state of an idea that has been building since 2014. The ingredients have been available, differentiable memory, neural arithmetic, GPU tensor primitives, for years. What projects like this demonstrate is that those ingredients compose into something coherent: a computer whose execution is a differentiable function. Whether that function gets used for program synthesis, meta-learning over program spaces, or something else entirely is an open question, but the substrate is now well understood.