Ten Thousand Programs at Once: The Real Use Case for a Neural Network CPU
Source: lobsters
Robert C. Price’s nCPU project has been making the rounds with a headline that sounds like a thought experiment: a CPU implemented as a neural network, running entirely on a GPU. The instinct is to evaluate it as a CPU, note that it is orders of magnitude slower than silicon for any single program, and move on. That evaluation is correct and also misses the point.
The more interesting question is what you get when you express a CPU as a composition of tensor operations. The answer is not a faster CPU. It is the ability to run ten thousand programs simultaneously, with near-zero marginal cost per additional instance, and the option to differentiate through every instruction cycle if you need to.
The CPU as a Tensor Program
The mapping from CPU state to matrix representations is more concrete than it might seem. Consider a minimal in-order processor. The program counter is a one-hot vector of length mem_size, where exactly one entry is 1.0 and the rest are 0.0, pointing to the current instruction address. Fetching the instruction at that address is a matrix product: instruction = PC @ M, where M is the memory matrix. The one-hot vector selects a single row, which is the instruction word. No branching logic, no address decoding, just a dot product.
The register file is a matrix: R of shape (num_registers, word_size). Reading register r is R[r, :], selected via another one-hot vector. Writing back to a register after an ALU operation is an outer product: R = R + (reg_select)^T * result, where reg_select is one-hot over register indices. This structure is directly differentiable, which becomes relevant later.
The instruction decode stage works similarly. A fixed weight matrix W_decode projects the instruction word into a set of control signals, one per output dimension: ALU operation select, source register selects, destination register select, immediate value, memory access type. The control signal vector is control = instruction @ W_decode, and because this is a fixed projection rather than a learned one, the weights are set deterministically by whatever ISA is being implemented.
The arithmetic operations are where gate-level thinking becomes visible. A single-bit AND gate is a neuron with weights [1, 1] and bias -1.5, using a hard step activation: output is 1 if the weighted sum exceeds zero, else 0. OR is the same network with bias -0.5. NOT is a single weight of -1 with bias 0.5. XOR requires one hidden layer because it is not linearly separable, which is the exact observation that made Minsky and Papert’s 1969 critique of perceptrons so influential at the time.
This is the full CPU fetch-decode-execute cycle as a pipeline of matrix multiplications, outer products, and elementwise threshold activations. None of the stages require control flow in the host language. Each instruction cycle is a composition of pure tensor operations.
The Soft Addressing Problem
The representation above works cleanly for registers and the instruction stream, but memory access is harder. Reading an arbitrary memory address using a one-hot vector requires that vector to be exact: one entry is exactly 1.0, all others exactly 0.0. In floating-point arithmetic on a GPU, this is not guaranteed to remain exactly one-hot after any transformations, particularly conditional branches that update the program counter.
The Neural Turing Machine paper from Graves, Wayne, and Danihelka at DeepMind in 2014 chose a different path: soft attention over all memory locations simultaneously. Rather than addressing a single row of the memory matrix, the NTM computes a weighted sum over all rows, where the weights are produced by a content-based similarity operation and a softmax normalization. This is differentiable end-to-end, which is exactly what makes NTM trainable, but it also introduces noise. If the softmax weights are [0.001, 0.002, 0.994, 0.003, ...], the read value is mostly correct but contaminated by the other rows. As memory grows larger, the contamination from small but nonzero weights accumulates, and read accuracy degrades.
The Differentiable Neural Computer, published in Nature in 2016, extended this with temporal link matrices and usage tracking, making the memory management more expressive but not fundamentally resolving the soft-addressing noise problem.
One proposed resolution is Sparsemax, a replacement for softmax that maps its input to a sparse probability distribution, where most entries are exactly zero rather than small positive numbers. With Sparsemax, a memory read can in principle be exact, and the operation remains differentiable in the regions where the output has nonzero support. This is a better match for the semantics of real memory access, but it is also harder to train because the gradient signal is zero for most addresses on most steps.
For nCPU specifically, using exact integer arithmetic with a hard one-hot program counter avoids the soft-addressing problem entirely at the cost of differentiability. The two requirements pull in opposite directions, which is the core tension in this design space.
The Carry Propagation Bottleneck
Arithmetic operations reveal another structural cost. Adding two N-bit numbers requires N full adder circuits, each computing a sum bit and a carry bit, where each carry depends on the previous carry. This is a ripple-carry adder, and it is inherently sequential: the carry from bit position 0 must be available before bit position 1 can complete, and so on through all N positions.
In a neural network, each layer in the carry chain adds one level of depth to the computation graph. A 32-bit addition is 32 layers of carry propagation. For a 64-bit processor, that is 64 layers just for addition. The network depth grows with bit width, which means the number of matrix operations per instruction cycle grows accordingly. This is one reason that even a single instruction execution involves far more computation than it appears from the high-level description.
Carry-lookahead adders reduce this depth logarithmically by precomputing carry propagation and generate signals for groups of bits in parallel, but expressing carry-lookahead as a neural network requires a more complex topology than the simple chain structure of ripple carry. The depth scales as O(log N) instead of O(N), at the cost of wider layers.
Why Batch Execution Changes the Calculation
Here is where the GPU’s hardware characteristics become relevant in a non-obvious way. GPU computation is optimized for throughput across batch dimensions, not latency on individual operations. Running one program instance through 32 layers of carry propagation might take the same wall-clock time as running a thousand program instances through the same 32 layers, because the GPU schedules all instances as independent rows of a batch, and the matrix operations are nearly the same cost at batch size 1 as at batch size 1024.
This changes what nCPU is useful for. As a single-program CPU, it is uncompetitive with anything. As a substrate for running thousands of distinct program traces simultaneously, the economics shift. Applications that need to evaluate many program variations concurrently are natural fits:
Evolutionary program synthesis. Maintain a population of candidate programs as different register-file and memory initializations. Run all of them forward in one batched tensor operation. Score the outputs, select for fitness, mutate, repeat. Each generation evaluates the entire population in a single GPU pass rather than sequentially.
Monte Carlo tree search over instruction sequences. Build a search tree over possible programs by sampling instruction sequences and evaluating their outputs. MCTS requires many rollouts, each a full program execution. Batching the rollouts across GPU makes this tractable at scales that sequential execution cannot reach.
Formal verification of behavioral equivalence. Testing whether two programs produce the same output on a large set of inputs normally requires running each program on each input independently. As a tensor operation, you can test all (program, input) pairs in one shot by batching both dimensions.
The Neural Arithmetic Logic Unit paper from Trask and colleagues at Google Brain in 2018 was motivated by a related observation: neural networks trained to perform arithmetic operations fail to generalize outside their training distribution. NALU addressed this by constraining weights through learned sigmoid-tanh products, forcing the network toward the exact ±1 weight values that implement real arithmetic. Subsequent refinements like iNALU tightened the constraint further. The through-line from NALU to nCPU is the same idea: exact arithmetic requires exact discrete operations, not approximations.
The Differentiable Path and What It Enables
If you relax the requirement for bit-exact correctness, the system becomes differentiable. Replace the hard step activation with a smooth approximation, like a steeply sloped sigmoid, and gradients can flow backward through every gate in every instruction cycle. The straight-through estimator from Bengio et al. provides another path: use the hard threshold in the forward pass but approximate its gradient as 1.0 in the backward pass, which is incorrect but empirically useful for training binarized networks.
With a differentiable CPU, you can define a loss on the program’s output and backpropagate through the instruction execution to optimize either the weights implementing the ISA or the program itself. This is program synthesis by gradient descent, which is an active research area with several distinct approaches. The Neural Programmer-Interpreters work from Reed and de Freitas at ICLR 2016 trained networks to generalize execution traces, learning program behavior from examples. What nCPU adds to that tradition is the explicit gate-level structure: rather than learning behavior in some implicit hidden-state space, the network is constrained to implement recognizable Boolean operations.
The Deep Differentiable Logic Gate Networks paper from Petersen et al. at NeurIPS 2022 approaches the same boundary from the training direction: parameterize each gate as a continuous relaxation over all 16 possible two-input Boolean functions, train discriminatively, then snap each gate to its maximum-weight function. The result is a trained network that compiles down to an actual logic circuit. nCPU is the inverse: start from a known-correct circuit, express it in tensor notation. Together these projects bracket the design space from both ends.
A CPU Is a Function
The useful reframe is to treat nCPU not as a CPU implementation but as a demonstration that CPU execution is a computable function in the mathematical sense, where that function is implemented as a composition of differentiable tensor operations. The same infrastructure used to train and run neural networks can host an instruction set architecture. The ISA becomes serializable as a model checkpoint, transmittable across networks, loadable on any GPU. Running one instance of that CPU costs roughly the same as running ten thousand instances in a batch, which is a property that no physical CPU has ever had.
This is not a replacement for semiconductor engineering. It is a different point in the design space, one where the interesting axis is not clock speed but batch throughput and differentiability. That point in the space has applications that conventional hardware and emulation do not reach well, and nCPU makes it concrete in a way that prior work mostly left theoretical.