· 7 min read ·

The CPU as a Weight Matrix: What nCPU Reveals About Computation

Source: lobsters

A CPU, reduced to its lowest level, is a function from one machine state to the next. Every clock cycle takes a tuple of register values, flags, a program counter, and some memory, and maps it to a new tuple according to the current instruction. That function is built from Boolean gates: AND, OR, NOT, and their combinations. The nCPU project by Robert Price asks a direct question: what stops you from expressing that function as a neural network and running it on a GPU?

The answer is: nothing. The result is educational, not practical, but the gap it closes between two domains that rarely speak to each other is worth examining carefully.

Logic Gates Are Neurons

The foundational observation is that a single neuron with a step activation function is a logic gate. Given two binary inputs and a threshold rule, you can reproduce any gate from the standard set. An AND gate fires when the weighted sum of its inputs exceeds 1.5, with both input weights set to 1:

import torch

def threshold(x):
    return (x >= 0).float()

# AND: w1=1, w2=1, bias=-1.5
def and_gate(a: float, b: float) -> float:
    return threshold(1.0 * a + 1.0 * b - 1.5).item()

# OR: same weights, bias=-0.5
def or_gate(a: float, b: float) -> float:
    return threshold(1.0 * a + 1.0 * b - 0.5).item()

# NAND: negated weights, bias=1.5
def nand_gate(a: float, b: float) -> float:
    return threshold(-1.0 * a + -1.0 * b + 1.5).item()

These are not approximations of logic gates. They are logic gates, expressed using the same primitives a neural network uses: a linear combination followed by an elementwise nonlinearity. The weights are set by hand, not learned, but they occupy the same data structures.

XOR is the interesting case. It is not linearly separable, so a single neuron cannot compute it. A two-layer network can, by composing a NAND and an OR into a second AND:

# XOR: hidden layer with NAND and OR units, output AND
h_weights = torch.tensor([[-1.0, -1.0],  # NAND row
                           [ 1.0,  1.0]])  # OR row
h_bias    = torch.tensor([1.5, -0.5])
out_weights = torch.tensor([[1.0, 1.0]])
out_bias    = torch.tensor([-1.5])

def xor_gate(a: float, b: float) -> torch.Tensor:
    x = torch.tensor([[a, b]])
    h = threshold(x @ h_weights.T + h_bias)
    return threshold(h @ out_weights.T + out_bias)

This is structurally identical to a small neural network classifier. The weights encode digital logic instead of a decision boundary, but the computation graph is the same. The fact that Minsky and Papert highlighted XOR in Perceptrons (1969) as the failure mode of single-layer networks makes this feel like a correction: XOR is not a problem for two-layer networks, and two-layer networks are sufficient to build any combinational circuit.

Since NAND is functionally complete, meaning any Boolean function can be expressed as a composition of NAND gates, a neural network with threshold neurons can represent any digital logic circuit exactly. The CPU is a digital logic circuit.

State and the Sequential Problem

Combinational circuits are stateless: given inputs, they produce outputs, with no memory of prior inputs. A CPU is stateful. Registers, flags, and a program counter all persist across cycles. In silicon, this is handled by flip-flops and latches that hold values between clock edges. In a neural network, it requires recurrent connections: outputs fed back as inputs at the next time step.

Hava Siegelmann and Eduardo Sontag proved in their 1992 paper that recurrent neural networks with rational-valued weights are Turing-complete. They construct a direct simulation of a Turing machine inside an RNN, encoding the tape and head state in the activation vector and advancing through time steps. With real-valued weights, RNNs are super-Turing, capable of computing functions no Turing machine can.

nCPU sits well within the rational-weight, Turing-complete regime. The registers and program counter live in a state vector that is fed back at each step. The forward pass reads the current state, computes the next state according to the encoded instruction set, and writes the result. Each forward pass is one clock cycle.

GPU as Execution Substrate

The GPU does not know or care what the weight matrices represent. It runs matrix multiplications. For a neural network encoding Boolean logic, those matrix multiplications compute gate evaluations across the entire CPU datapath in parallel. All gates in a given pipeline stage evaluate simultaneously, which mirrors real hardware: in an actual CPU, the AND gate computing the carry bit and the OR gate computing a flags condition are both switching at the same time.

The GPU’s natural advantage is over the batch dimension. Running one CPU instance as a neural network is extremely slow compared to native execution. Running 10,000 CPU instances simultaneously, each with different register contents and different programs loaded in memory, costs approximately the same as running one. This is the setting where the approach becomes interesting for anything beyond demonstration: large-scale parallel CPU simulation for verification, fuzzing, or hardware design exploration.

Frameworks like PyTorch and JAX are optimized for exactly this workload. The static computation graph of a single CPU pipeline stage, with no data-dependent branching within the gate-level implementation, maps well to the XLA or CUDA compilation models. Branches in the simulated program are handled inside the model as multiplexer logic, not as host-side control flow.

What Came Before

The connection between neural networks and universal computation has a long theoretical lineage. Beyond Siegelmann and Sontag, DeepMind’s Neural Turing Machine (2014) showed that a learned controller with external soft-attention memory could learn algorithms like sorting and copying from examples alone. Its successor, the Differentiable Neural Computer, added dynamic memory allocation and demonstrated navigation of graph structures. Both systems were trained rather than hand-coded, which trades bit-exact correctness for generalization.

A different thread comes from cellular automata. Conway’s Game of Life is Turing-complete, and several implementations have expressed the Game of Life update rule as a convolutional neural network with fixed weights. Mordvintsev et al. used similar ideas for neural cellular automata that learn to self-organize. A CNN running Game of Life is, by transitivity, a universal computer, though at similarly impractical speeds.

The work most directly adjacent to nCPU is probably Petersen et al.’s Deep Differentiable Logic Gate Networks from NeurIPS 2022. That work parameterizes each gate as a continuous relaxation over all 16 possible 2-input Boolean functions and trains the entire circuit discriminatively. After training, each gate snaps to its winning Boolean function, yielding a real digital circuit. nCPU inverts this: start from a known correct circuit, translate it into neural-network notation, skip the training. The end products are different, but both demonstrate the same underlying equivalence between circuit logic and neural computation.

The Floating-Point Tension

Running a deterministic Boolean system on floating-point GPU hardware introduces one genuine complication. GPU floating-point arithmetic is not fully reproducible across runs: reduction order depends on thread scheduling, and rounding accumulates differently across hardware generations. For a system that should produce exact 0s and 1s, this is a latent source of corruption.

The practical solution is integer representation throughout. If all activations are stored as integers 0 or 1, and the matrix multiplications use integer arithmetic or are structured to avoid rounding sensitivity, the simulation is bit-exact. The “neural network” framing then describes the execution infrastructure rather than the precision regime.

Differentiability is the cost of the integer choice. If you want to train against the simulation output, backpropagate through clock cycles, or use the CPU model as a component in a larger learned system, you need continuous activations and soft approximations to the hard Boolean operations. The straight-through estimator from Bengio et al. (2013) handles gradient flow through hard thresholds at the cost of gradient bias. This is the same trade-off that appears throughout quantization research: exactness and differentiability pull against each other, and you choose based on what the downstream use case requires.

The Conceptual Compression

What nCPU actually demonstrates is a compression of conceptual distance. Programmers working with GPU tensor libraries and programmers working with CPU microarchitecture occupy very different mental models. The former thinks in batches, dimensions, and broadcast rules; the latter thinks in opcodes, pipeline stages, and forwarding paths. These feel like different disciplines.

nCPU shows that the operations at the bottom of each discipline are the same: multiply, add, compare, threshold. The difference is in what the tensors represent. Transformer attention weights encode learned associations between tokens. nCPU weight matrices encode Boolean truth tables. The GPU does not distinguish between these use cases at the level of arithmetic.

This matters because it sets a practical floor on how different a “neural” system and a “classical” system can really be. If you can translate one into the other by changing weight initializations, the two are the same class of machine at the level of what they can compute. The boundary between running a program and evaluating a model, which has grown blurrier as language models have become better at following multi-step instructions, is not a fundamental boundary at all. It is a matter of what the weights say.

Was this interesting?