· 6 min read ·

When a CPU Is Just a Very Long Forward Pass

Source: lobsters

A CPU, at the register-transfer level, is a fixed computation graph. Each clock cycle takes an input state, applies a deterministic transformation defined by the current instruction, and produces a new state. That transformation, broken down far enough, is nothing but AND gates, OR gates, and a handful of latches. The nCPU project by Robert Price takes that observation to its logical end: implement every gate as a neuron, batch the neurons into tensor operations, and run the entire CPU as a sequence of forward passes on GPU hardware.

This is a demonstration rather than a research artifact, and the point it makes is sharp: the same tensor arithmetic that runs a transformer attention head can also run a fetch-decode-execute loop. The substrate does not care.

How Boolean Logic Becomes Tensor Arithmetic

The mathematical basis is classical. A single neuron with a Heaviside step activation can compute any unary or binary Boolean function given appropriate weights and bias. AND is a neuron with both input weights set to 1 and a bias of -1.5: the output exceeds the threshold only when both inputs are 1. OR uses a bias of -0.5. NOT is a single input with weight -1 and bias 0.5. NAND and NOR follow immediately. XOR requires one hidden layer, which is the exact example Minsky and Papert analyzed in Perceptrons (1969) to characterize the expressive limits of single-layer networks.

That observation translates directly to tensor operations. If you stack many gates into a single layer and represent all wire values as a float tensor, one matrix multiply plus an elementwise threshold activation computes an entire layer of Boolean gates simultaneously. Stack enough of these layers and you have an adder; stack adders and you have an ALU. The full CPU datapath becomes a deep but fixed-topology network where the weights encode the Boolean functions of each gate rather than any learned relationship.

What GPU hardware does well is exactly this: massive parallelism over elementwise operations and matrix products. A layer of 10,000 gates runs no slower than a layer of 10 if the batch dimension is large enough. This is the property that makes the approach interesting beyond novelty. The goal is not to execute a single CPU instance faster than silicon; it is to run thousands of independent instances in parallel, one per element in the batch. Each has different register contents, a different program, a different memory image. The GPU schedules them all simultaneously.

The Computation Graph as Circuit Netlist

A standard CPU datapath, without the complications of out-of-order execution, maps cleanly to the kind of static computation graph that tensor frameworks optimize well. Each pipeline stage is a transformation on the state vector:

  1. Instruction fetch: index into memory at the address stored in the program counter
  2. Decode: demultiplex the instruction word into control signals
  3. Execute: apply the ALU function to the selected operands
  4. Writeback: conditionally update register file entries and flags
  5. PC update: add an offset or load a branch target

Each stage is a function of the current state tensor. Implementing it in PyTorch or JAX means writing each stage as tensor operations: indexing for fetch, bitwise extraction and masking for decode, arithmetic and shift operations for execute, scatter for writeback. The statefulness, the register file and memory, lives in tensors updated in place each cycle.

The cycle loop itself is not differentiable in the usual sense; you are stepping through discrete time, not unrolling into a single computation graph. But each individual cycle is differentiable with respect to the state, which matters if you want to train anything against the simulation output or use the simulator as a component in a larger learned system.

Prior Art: A Long Lineage

The concept of differentiable or neural-network-based computation has a substantial history. The Neural Turing Machine, introduced by Graves et al. at DeepMind in 2014, couples a recurrent controller with an external memory bank accessed through soft attention. Every read and write operation is a weighted sum over all memory locations, keeping the whole system end-to-end differentiable. The Differentiable Neural Computer, published in Nature in 2016, extended this with dynamic memory allocation and temporal linking between written locations.

Both of these are learned systems trained to perform algorithms. nCPU is different: it is a hand-coded implementation of a CPU in neural-network form, not a system trained to mimic one. The architecture is fixed by the circuit design, not by gradient descent.

The more direct parallel is the differentiable logic gate network work from Petersen et al., presented at NeurIPS 2022 under the title “Deep Differentiable Logic Gate Networks.” That paper parameterizes each gate as a continuous relaxation over all 16 possible 2-input Boolean functions. During training, each gate is a soft mixture of all 16 functions; after training, it snaps to whichever Boolean function has the highest learned weight. The result is an actual logic circuit trained discriminatively. nCPU inverts this: start from the circuit, express it in tensor notation, and skip the training entirely.

Reed and de Freitas’s Neural Programmer-Interpreters (ICLR 2016) took yet another angle, training recurrent networks to learn execution traces of programs by watching them run. These systems generalize across programs but trade off the bit-exact correctness that a direct circuit implementation provides.

The Determinism Problem

Running a fundamentally deterministic system on floating-point infrastructure introduces a practical tension. GPU floating-point arithmetic is not fully reproducible across runs; reduction orders depend on thread scheduling; accumulated rounding errors grow with the number of clock cycles simulated.

For a design that uses hard integer or fixed-point values throughout, this is manageable. If every neuron holds an exact 0 or 1, encoded as an integer, and the matrix multiplies use integer arithmetic, the result is bit-exact. The soft approximation problem only arises if you want the simulation to be differentiable with respect to inputs or state, at which point you need continuous relaxations of the hard Boolean operations, and those relaxations accumulate error across cycles.

This is an inherent cost of expressing discrete systems in continuous arithmetic, not a flaw unique to nCPU. The straight-through estimator from Bengio et al. (2013) is the standard workaround for training through hard thresholds, but it introduces gradient bias. For a system that is executed rather than trained, integer representations sidestep the problem entirely, and the “neural network” framing becomes a description of the execution substrate rather than a claim about learning.

What This Points Toward

The immediate practical application for a batched CPU simulator running in a tensor framework is formal verification at scale: execute millions of distinct program traces simultaneously and check for behavioral equivalences or invariant violations. Software-defined hardware research already uses FPGA simulation in similar ways; doing it inside a standard ML framework lowers the barrier to integration with trained models that reason about hardware behavior.

There is also the hardware synthesis direction. If you can express a CPU as a differentiable function, even a soft approximation, you can in principle apply gradient-based optimization to explore ISA design spaces, minimize energy per operation, or search for implementations that satisfy multiple objectives simultaneously. The difflogic line of work points directly at this: circuits trained to minimize classification error while subject to area or latency constraints. nCPU builds the bridge from the other side, starting from a known-correct circuit and expressing it in the language that optimization tooling understands.

What the project demonstrates, finally, is that the conceptual distance between a GPU tensor library and a CPU datapath is smaller than most working programmers assume. The same operations that power transformer attention, matrix factorization, and image convolution can also implement fetch-decode-execute. The substrate is general enough that the distinction between running a model and running a program starts to look like a matter of how you initialize the weights. nCPU makes that compression visible, and that is what makes it worth studying even if you would never run production code on it.

Was this interesting?