· 6 min read ·

The Original Neural Network Was a Logic Gate: How nCPU Closes the Loop

Source: lobsters

McCulloch and Pitts published their 1943 paper “A Logical Calculus of the Ideas Immanent in Nervous Activity” with an explicit goal: to show that neurons could compute. Not learn, but compute. Their threshold neuron model, a unit that fires when a weighted sum of inputs crosses a value, was designed to demonstrate that the brain’s mechanism could in principle implement any boolean function. The “neural network” they described was a logical machine first, a biological model second.

nCPU by Robert Price arrives in 2024 from exactly that starting point. Every logic gate in its CPU is implemented as a threshold neuron with hand-set weights. Executing a clock cycle is a forward pass through a PyTorch network. The project makes no claims about performance; it is recovering something from the 1943 paper that the subsequent eight decades of machine learning largely set aside.

The First Paper Already Had the Answer

The 1943 formulas are simple enough to reproduce in a few lines. A Heaviside step activation applied to a linear combination gives exact boolean operations:

# AND(x, y) = step(x + y - 1.5)
# OR(x, y)  = step(x + y - 0.5)
# NOT(x)    = step(0.5 - x)

Given binary inputs (exactly 0 or 1), each expression produces the correct boolean output. No approximation, no rounding error. In PyTorch, an AND gate is a two-input linear layer with weight tensor [[1.0, 1.0]] and bias -1.5, followed by a threshold comparison:

class ANDGate(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 1, bias=True)
        with torch.no_grad():
            self.linear.weight.copy_(torch.tensor([[1.0, 1.0]]))
            self.linear.bias.copy_(torch.tensor([-1.5]))

    def forward(self, x):
        return (self.linear(x) >= 0).float()

McCulloch and Pitts proved that any boolean function is computable by a finite network of such neurons, which makes any digital circuit expressible as a neural network with fixed, deterministic weights. This was not a footnote; it was the paper’s central claim.

What the paper did not contain was learning. Weights were specified by the designer to implement a desired logical function. The Hebbian learning rule (1949), the Perceptron (1958), and backpropagation (formalized in 1986) came later, and the field progressively shifted attention from “neural networks as logic” toward “neural networks as learning systems.” The two threads, computation and learning, share the same architecture but diverged completely in their research programs.

The XOR Detour and What It Actually Shows

Minsky and Papert’s Perceptrons (1969) is famous for demonstrating that a single-layer perceptron cannot compute XOR. That result damaged the field’s confidence for a decade. The result also contains what nCPU needs: XOR requires exactly two layers. A single hidden layer with two neurons is sufficient.

In terms of circuit depth, XOR is already the hardest basic gate. A ripple-carry adder chains 32 XOR operations for a 32-bit sum, with carry propagation adding depth. A carry-lookahead adder trades layer depth for width. These tradeoffs translate directly into neural network architecture choices, deeper networks for area-constrained logic, wider networks for latency-constrained logic.

The Minsky-Papert result, read in this light, is not a limitation but a characterization. It says that every combinational logic circuit has a minimum network depth, and that minimum is determined by the circuit’s critical path. nCPU’s network is that circuit, layer by layer.

What Machine Learning Forgot and Hardware Remembered

For most of the period between 1943 and 2015, artificial neurons were learning systems. Hardware designers kept using boolean logic without reference to the neural network formulation, because the formulation added nothing useful for CMOS fabrication.

The two branches reconverged in unexpected ways. Binarized Neural Networks, introduced by Courbariaux and Hubara et al. at NeurIPS 2016, forced neural network weights to ±1 and activations to binary during the forward pass. The motivation was inference speed and model compression, but the result was a trained network where every multiply-accumulate collapses to XNOR-popcount, counting matching bits between binary vectors. Binarized networks are, at inference time, combinational logic circuits discovered through gradient descent rather than designed by hand.

Microsoft Research’s BitNet b1.58 (2024) extends this with ternary weights {-1, 0, +1}, so that matrix multiplication collapses to conditional addition and zero-skipping. A ternary weight matrix applied to a binary activation vector is structurally equivalent to gate-level simulation: each output is a sum of conditionally added inputs, with zero-weight connections simply absent. BitNet is a language model architecture converging toward circuit representation from the training direction; nCPU is a circuit representation expressed in the tensor framework from the design direction. They meet in the middle.

Looped Transformers and the Computation Angle

A parallel research thread asks whether transformers can simulate full computer programs. Weiss, Goldberg, and Yahav’s RASP formalism (2021) demonstrated that transformer computations can be described using a restricted-access sequence processor, showing that the transformer’s attention mechanism is sufficient to simulate finite state machines. Each attention head reads from specific sequence positions according to learned patterns; the stack of heads computes a program over the sequence.

Yang et al.’s Looped Transformers as Programmable Computers (2023) goes further: a single transformer block applied repeatedly to a residual stream can simulate arbitrary computer programs. The residual stream is working memory; each forward pass is a clock cycle; the weights encode the instruction set. This is structurally identical to what nCPU does with explicit circuit weights. The looped transformer research asks which architectures are learnable under this structure; nCPU demonstrates that a specific known-correct architecture is expressible under it by construction.

Both lines of work share the same underlying claim: transformer computation and CPU computation are not fundamentally different in their mathematical structure. They differ in parameterization and whether the weights were designed or trained.

Where Differentiability Changes the Calculus

The property that conventional emulators like QEMU, Unicorn, and gem5 cannot provide is differentiability through execution. Because nCPU is expressed entirely as tensor operations, you can define a loss on program output and compute gradients with respect to the program, the initial register state, or (through soft addressing) the memory contents.

This enables gradient-based program synthesis. Given a target behavior, you can search for a program by gradient descent rather than symbolic search or exhaustive enumeration. The straight-through estimator from Bengio et al. (2013) handles the hard boolean thresholds by passing gradients through as if the threshold were the identity function, accepting gradient bias in exchange for a differentiable training path. The approach is noisy and approximate, but it opens a route to optimization-based program search unavailable from a traditional emulator.

The batch parallelism point is more immediately practical. Running nCPU inside a GPU framework means 10,000 independent CPU instances, each with different programs and memory images, cost roughly the same wall time as one instance. Hardware verification, fuzzing, and formal trace analysis all benefit from this structure. These are not hypothetical applications: QEMU supports multiprocessing, but the instances are independent OS processes, not batched operations on shared tensor resources.

The NeurIPS 2022 work on Deep Differentiable Logic Gate Networks by Petersen et al. offers the most direct comparison. That work parameterizes each gate as a continuous relaxation over all 16 possible two-input boolean functions, training to convergence before snapping to discrete gate assignments. The direction is opposite to nCPU: start from tensor operations and converge toward a circuit. The research question is also different, focusing on classification accuracy rather than instruction execution. Together, the two projects bracket the space: you can arrive at a logic circuit through training, and you can express a logic circuit through tensor operations by hand, and both use the same mathematical substrate.

The Substrate Does Not Care

What nCPU makes visible is that the conceptual distance between a tensor operation and a logic gate is zero. The gate is literally a neuron. McCulloch and Pitts stated this in 1943 and the field moved on to learning. The GPU hardware that arrived to accelerate matrix multiplies for machine learning turns out to also be efficient substrate for logic simulation, because the underlying primitive is the same: dot product followed by threshold.

This does not mean nCPU is competitive with FPGA simulation or specialized hardware emulators for any throughput metric. Single-instance sequential execution will always be slower when each clock cycle is a matrix multiply, and the floating-point substrate requires care to maintain bit-exact determinism (integer representations sidestep this entirely by removing floating-point from the gate computations). The value is in differentiability, batch parallelism, and integration with the rest of the ML toolchain.

The distinction between “running a program” and “running a model” starts to look like a matter of whether the weights were designed or trained. nCPU does not blur that distinction, it just makes explicit that both sit on the same mathematical foundation that McCulloch and Pitts laid out before anyone was training anything.

Was this interesting?