· 6 min read ·

The Inverted Stack: Running a CPU Through Neural Network Gates on a GPU

Source: lobsters

The relationship between neural networks and digital logic has always been intimate in theory and mostly ignored in practice. A single perceptron, given the right weights, implements any linearly separable boolean function. A two-layer network handles XOR. Compose enough of these and you have a full combinational circuit, which means you have the kernel of a CPU. nCPU, a project by Robert C. Price, turns this theoretical relationship into a running artifact: a CPU where every logic operation is a neural network evaluation, executed entirely on GPU hardware.

This is worth understanding on its own terms, rather than treating it as a novelty.

Boolean Logic Is a Special Case of Neural Networks

The foundational claim is not controversial. Take a neuron with weights w1 and w2, bias b, and a step activation function (output 1 if the weighted sum exceeds zero, else 0). Setting w1 = 1, w2 = 1, b = -1.5 gives you an AND gate; set b = -0.5 and you have OR. NOT is a single neuron with w1 = -1, b = 0.5.

The classic 1969 result from Minsky and Papert established that XOR is not linearly separable, requiring one hidden layer with two intermediate neurons to implement correctly. But beyond that limitation, any boolean function over n variables can be expressed as a feedforward network with at most two layers, since any boolean function can be written in disjunctive normal form, and each term in that form is a threshold-gate conjunction.

These are not approximations. Given the right weights and a hard threshold activation, a neural network computes exactly the same boolean function as a silicon gate. The distinction is execution substrate: silicon gates switch physically; the neural network version evaluates via floating-point arithmetic on a programmable processor.

From individual gates, you build half adders, full adders, ripple-carry adders, and from there arithmetic logic units. Register files follow from multiplexers; control logic from decoders and enable signals. The entire digital logic stack underlying every conventional CPU is, at each level, composable from networks of this form.

Why the GPU Is the Natural Target

Running this on GPU is not just an aesthetic choice. Neural network inference is fundamentally a series of matrix multiplications followed by element-wise activation functions, and GPUs were optimized specifically for this class of operation. NVIDIA’s tensor cores perform matrix multiplications in a single clock cycle across wide data paths; the CUDA memory hierarchy is designed to keep data close to the compute units that consume it.

When nCPU encodes gate operations as small neural network layers, those layers evaluate as batched matrix-vector products. Thousands of gates can run in parallel across the GPU’s shader units. The hardware architecture that emerged from graphics rendering, and was later co-opted for machine learning, turns out to be well-suited for a simulation where the underlying primitive is a dot product and a threshold comparison.

There is a notable irony here. The industry has spent years designing dedicated hardware, from Google’s TPUs to Apple’s Neural Engine, to run neural networks efficiently on silicon. nCPU inverts this: it runs neural networks on GPU hardware that was already optimized for neural networks, but in service of simulating a classical CPU. The abstraction stack has eaten itself.

The Distance from Traditional Emulation

A conventional CPU emulator like QEMU represents each target instruction as a function in the host language. The emulator matches the opcode, dispatches to the corresponding handler, updates the virtual register file and memory, and moves to the next instruction. The logic is explicit: each instruction’s semantics is written out by a human programmer as host code.

nCPU’s approach is structurally different. The instruction semantics are encoded in weight matrices. The same matrix multiplication pathway that evaluates gate behavior for one part of an addition also handles carry propagation and register writeback. No human wrote the logic gate by gate in the conventional sense; the architecture is specified by the network topology and the weight values.

This has a consequence that is easy to overlook. In a conventional emulator, adding a new instruction means writing new code. In a neural-network CPU, the natural mechanism for change is the same gradient descent that neural networks use for everything else: define the desired input-output behavior, compute the loss, backpropagate. The instruction set becomes a trained behavior rather than a compiled function.

The Differentiability Angle

The most interesting implication of this architecture is what happens when you treat the weights as learnable parameters rather than fixed constants. A CPU with learnable weights is a differentiable program. You can define a loss function over the CPU’s output, propagate gradients back through the gate network, and update the weights via an optimizer.

This connects nCPU to a line of research going back to Neural Turing Machines (Graves et al., 2014), which added differentiable memory access to recurrent networks, allowing a network to learn programs rather than just pattern-match inputs. The Differentiable Neural Computer extended this with richer memory operations, including temporal linkage and content-based addressing, making the memory more expressive and the learning more stable.

What nCPU proposes, implicitly, is a further step: a full register-machine architecture where the gate logic itself is a gradient target. Feed the network a dataset of instruction traces and let it converge on an instruction set. The ISA becomes a learned artifact, inferred from examples of correct computation.

Whether this produces useful results in practice is still an open question. Training neural networks to implement exact boolean functions requires careful architecture choices. Standard floating-point activations like ReLU or sigmoid approximate the step function but do not replicate it exactly, which means evaluation may produce incorrect results at the margins. Binarized neural networks, which constrain weights and activations to ±1 during both training and inference, are a more faithful approach, though they introduce their own training dynamics. The straight-through estimator is typically used to pass gradients through the non-differentiable binarization step.

Where This Sits in the Broader Landscape

Kaiser and Sutskever’s Neural GPU work demonstrated in 2016 that GPU-native neural architectures could learn algorithms like long addition and multiplication from examples alone, with parallelism that scales with input length. That work focused on high-level algorithmic behavior rather than gate-level simulation, but the thread connecting it to nCPU is clear: both explore what computation looks like when expressed as learned neural network behavior rather than explicit procedural code.

Projects like nCPU are not trying to replace hardware CPUs, and evaluating them on performance grounds misses the point. A neural-network implementation of a 32-bit adder is orders of magnitude slower and more energy-intensive than the CMOS equivalent. The interesting territory is what becomes possible when CPU-like computation is expressed as differentiable operations. Gradient-based ISA search, CPU behavior specification by example, fine-tuning an existing simulated architecture to extend its instruction set by continuing training, all of these are applications that have no analog in conventional emulation.

From a systems programming perspective, the idea that an instruction set could be a set of trained weights rather than a hardware specification shifts some important assumptions. You could in principle serialize a CPU the same way you serialize a model checkpoint, transmit it, load it on different GPU hardware, and run it. The ISA becomes a blob of floating-point values rather than a silicon die or a fixed software table. Whether that property is useful depends on the application, but it is a genuinely different thing from what emulation currently offers.

nCPU is a concrete existence proof that a CPU can be expressed as a neural network running on a GPU. The gap between that existence proof and a practical learned ISA is large, but the foundation is now a concrete software artifact rather than a theoretical sketch, and that is a meaningful step.

Was this interesting?