Computation as Linear Algebra: How nCPU Builds a CPU from Neural Network Weights
Source: lobsters
The idea behind nCPU is straightforward to state and strange to contemplate: implement a CPU entirely as a neural network, with all computation encoded as weights, and run the whole thing as a forward pass on a GPU. The project makes no claims to performance efficiency; it is a precise demonstration of something the theoretical literature has known since 1943, that a sufficiently constructed neural network can simulate any boolean circuit exactly, and by extension, any computation.
The McCulloch-Pitts Foundation
The lineage starts with Warren McCulloch and Walter Pitts, who showed in their 1943 paper “A Logical Calculus of the Ideas Immanent in Nervous Activity” that threshold neurons, connected with appropriate weights, can compute any boolean function. The proof is constructive. A single neuron with inputs x₁ and x₂, weights [1, 1], bias -1.5, and a Heaviside step activation computes AND. Change the bias to -0.5 and you get OR. A neuron with weight -1 and bias 0.5 computes NOT. From those three primitives you can build any combinational circuit, and from any combinational circuit you can build a CPU.
The encoding is mechanical. A one-bit half-adder requires an XOR gate (two layers: a hidden layer of two neurons and one output neuron) and a carry bit computed by AND. A ripple-carry adder for n-bit integers chains these across bit positions. An ALU adds a multiplexer layer that selects among operations based on opcode inputs. The weights for all of this can be computed directly from truth tables, with no gradient descent involved.
Here is what a two-input AND gate looks like as a PyTorch layer, to make the pattern concrete:
import torch
import torch.nn as nn
class ANDGate(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(2, 1, bias=True)
with torch.no_grad():
self.linear.weight.copy_(torch.tensor([[1.0, 1.0]]))
self.linear.bias.copy_(torch.tensor([-1.5]))
def forward(self, x):
return (self.linear(x) >= 0).float()
Every gate in nCPU is a variation of this pattern. The CPU is a composition of such modules, from gates to half-adders to full adders to an ALU, with registers held as recurrent state passed between forward passes.
The Substrate Mismatch
Modern GPUs execute float32 or float16 matrix multiplications at high throughput. An NVIDIA A100 sustains around 312 TFLOPS in float16. In nCPU, each of those floating-point multiplications computes a single bit of boolean logic. You are spending 16 bits of precision to represent a value that is either 0 or 1, which wastes most of the information capacity of each operation.
Beyond precision waste, logical depth compounds the problem. A typical simple CPU might have a critical path of 20 to 30 gate delays per clock cycle for operations like integer multiplication. In nCPU’s model, each gate delay corresponds to at minimum one matrix multiplication layer. So a single clock cycle requires tens of sequential forward passes. GPU parallelism does not reduce sequential depth; it only helps with parallelizing across independent instances.
That last point is where the GPU argument holds up. Batching the input lets you simulate thousands of independent nCPU instances simultaneously, each executing different programs or the same program on different data. The per-instance latency is high; the aggregate throughput across a large batch is substantial. For workloads requiring many independent CPU-like computations running in parallel, this maps reasonably well onto GPU architecture, resembling how GPU-based emulators achieve throughput not through fast single-core execution but through volume.
Prior Art
nCPU is not the first project in this space, and the research lineage is worth knowing.
Neural Turing Machines (Graves et al., DeepMind, 2014) took the opposite approach: rather than hand-coding weights for known logic, they built a differentiable architecture with a neural controller and an external memory tape accessed through soft attention, then trained the whole system to learn algorithms from input-output examples. NTMs could learn to copy sequences, sort lists, and perform associative recall. The follow-up Differentiable Neural Computer (Nature, 2016) added dynamic memory allocation and extended the approach to more complex algorithmic tasks.
Neural Programmer-Interpreters (Reed and De Freitas, 2016) went further, using a recurrent core that could compose learned sub-programs hierarchically. The system learned an interpreter capable of executing programs represented as learned primitives, without explicit symbolic scaffolding.
The most directly relevant research is the Neural Arithmetic Logic Unit (Trask et al., NeurIPS 2018). NALUs are neural modules designed to learn systematic numerical operations that generalize outside the training distribution. A standard network trained to add numbers in the range [1, 100] fails badly at inputs in [1000, 10000]; a NALU does not, because the module uses a structured weight configuration designed to represent operations like addition and multiplication explicitly rather than approximating them with arbitrary learned functions.
nCPU sits at one end of a spectrum, with NTMs and NALUs toward the other. nCPU hand-codes weights to implement known boolean functions exactly; NTMs and NALUs attempt to learn programs or arithmetic from data. Both approaches are asking the same underlying question: what does it look like when a neural network computes?
Control Flow Without Branching
The framing of “runs completely on GPU” is worth unpacking. For a conventional CPU emulator, GPU execution is typically awkward: branch-heavy control flow, irregular memory access patterns, and sequential state dependencies are all pathological for GPU parallelism. nCPU sidesteps this by encoding control flow into the network structure itself. The program counter, conditional branching, and instruction decode all live inside the weight matrices. There is no conditional dispatch on the GPU; there are only matrix multiplications and activation functions.
Every path through the conditional logic is evaluated simultaneously as part of the forward pass, with the correct result selected by gating weights. This is the same principle behind attention mechanisms and gating in LSTMs, applied to CPU architecture. The clock cycle becomes an unconditional computation: for a minimal instruction set architecture, every forward pass computes all possible next states and selects the right one.
For small instruction sets this is manageable. For complex architectures, the combinational complexity of evaluating every possible next state simultaneously becomes exponentially expensive. This constraint almost certainly shapes nCPU’s choice of a simple, reduced instruction set; the full combinational complexity of x86 would produce a network too deep and wide to run practically.
What This Demonstrates
Projects like nCPU function as foundational exercises. By re-expressing CPU logic in the language of neural networks, they make visible an equivalence that theory has always known: boolean logic and matrix multiplication are different implementations of the same underlying structure, with different performance profiles on different hardware substrates.
While the practical applications are limited to demonstration and education, the theoretical grounding that makes nCPU possible carries real weight. When researchers claim that transformers develop “internal algorithms” or that large language models learn to “simulate computation,” the credibility of those claims rests on this foundation. nCPU demonstrates concretely that computation can live inside weights, encoded exactly rather than approximated. The gap between a hand-coded boolean CPU and a learned computational system reduces to a question of how weights were set, which is a tractable problem once you accept that the weights encoding exact computation can exist at all.