The Serial Dependency Problem at the Heart of nCPU, and How It Was Solved Twice
Source: lobsters
The premise of nCPU is that a CPU is a circuit, a circuit is a collection of boolean gates, and a boolean gate is a neuron with fixed weights. Robert Price’s project encodes those weights directly and runs the entire fetch-decode-execute loop as a sequence of GPU forward passes. The project has been discussed extensively as a theoretical existence proof — proof that the same tensor operations driving transformer inference can also run binary logic. The technical detail that receives less attention is the performance bottleneck, and it turns out to be one of the oldest problems in computer architecture, solved by hardware engineers in the 1950s and solved again by neural network researchers in 2017 without either group knowing they were working the same problem.
The Sequential Carry Chain
The most basic operation a CPU performs is integer addition. In any binary ripple-carry adder, each bit position must wait for the carry from the previous position before it can compute its sum. For a 32-bit adder, this creates a chain of 32 full-adder stages: bit 0 computes first, produces a carry to bit 1, which waits for that carry, produces a carry to bit 2, and so on through bit 31. The critical path is O(n) gate delays for an n-bit adder.
In nCPU’s execution model, gate depth maps directly to tensor depth. Each layer of the neural network corresponds to one level of boolean logic, and those layers must execute sequentially when data flows between them. A ripple-carry adder occupies O(n) sequential layers, each stalled waiting for the previous result. For 32-bit addition, that is 32 sequential matrix multiplications. The GPU, a machine built for massive parallelism across the batch dimension, sits mostly idle through all of them.
The problem compounds beyond addition. A complete instruction cycle includes instruction fetch, decode, ALU execution, optional memory access, and register writeback. The total gate depth runs somewhere between 20 and 30 gate delays for a typical simple ISA. In nCPU’s model, each of those delays requires at least one tensor layer, meaning a single simulated clock cycle needs tens of sequential forward passes. Single-instance latency is therefore far worse than any conventional emulator, and the gap from real hardware is not a constant factor but a multiplier applied at every clock cycle across the full execution trace.
The Carry-Lookahead Solution
IBM engineers designing the 709 in the late 1950s understood this problem precisely. Waiting for carries to ripple through all 36 bits of the word made addition a bottleneck. The solution, formalized by Weinberger and Smith in 1956 and extended by Kilburn and colleagues, is the carry-lookahead adder.
The core observation is that for each bit position i, you can compute two signals independently of the carry chain: the propagate signal P_i = A_i XOR B_i (does this position pass a carry through if it receives one?) and the generate signal G_i = A_i AND B_i (does this position produce a carry regardless of carry-in?). These signals can be computed for all bit positions simultaneously in a single parallel layer. From them, you can precompute carry signals for multiple positions ahead by expanding the recurrence:
C_out = G + P * C_in
across multiple levels of lookahead. A four-bit carry-lookahead group reduces the serial depth from 4 stages to 2: one parallel pass to compute P and G for all bits, one pass to produce carry-out using the precomputed signals. A 32-bit adder built from eight such groups, with a second level of lookahead on top, reduces total depth from O(n) to O(log n).
Translated into nCPU’s tensor terms: a ripple-carry adder for 32-bit integers requires 32 sequential tensor layers; a carry-lookahead adder expressed as the same kind of fixed-weight network requires around 5. The tradeoff is depth for width. The lookahead layers are wider, with each neuron receiving inputs from more positions, but the total number of sequential steps falls by more than a factor of six. For a design targeting GPU execution, where width is cheap and sequential depth is the bottleneck, the carry-lookahead structure is unambiguously better.
The Same Tradeoff, Sixty Years Later
When recurrent neural networks were the dominant sequence model, they suffered the same bottleneck. An LSTM processing a sequence of length n required n sequential hidden state updates, each depending on the previous. Parallelizing training across the sequence dimension was impossible; the serial dependency chain meant long sequences were slow regardless of available GPU hardware.
The transformer’s attention mechanism broke this dependency. By replacing serial state-passing with a global attention computation over all positions simultaneously, transformers eliminated the O(n) serial depth. Instead of n sequential matrix multiplies, attention does one O(n²) operation attending to all positions at once. The tradeoff is the same as carry-lookahead: sequential depth falls from O(n) to O(1) while the parallel computation at each step grows from O(1) to O(n²).
The engineers at Google who wrote Attention is All You Need were not thinking about carry-lookahead adders. The circuit architects of the 1950s were not thinking about sequence models. The mathematical problem was the same in both cases, and the solution structure was the same: precompute intermediate signals that break the serial dependency chain, accept wider computation at each parallel step, reduce total sequential depth.
Both are instances of the parallel prefix technique. The prefix sum structure, described in the parallel algorithms literature by Ladner and Fischer in 1980 and applied to GPU scan operations, circuit design, and sequence modeling, is the general form: given a recurrence x_i = f(x_{i-1}, a_i), compute all outputs simultaneously in O(log n) depth by building a binary tree of partial applications. Carry propagation is one instance; attention over a sequence is another. The shape of the solution is identical in both fields because the problem is the same problem.
What the Batch Dimension Actually Enables
Even with O(log n) adders reducing per-cycle tensor depth, nCPU will never match real hardware for single-instance execution. Floating-point arithmetic representing binary values carries irreducible overhead: 16 bits of FP16 precision to encode a value that is either 0 or 1. The individual-instance argument does not hold, and improving the adder architecture makes the situation less bad rather than competitive.
What is genuinely efficient in this substrate is parallel execution across many independent instances. The GPU’s batch dimension is essentially free up to memory bandwidth limits. A batch of 10,000 independent CPU states costs nearly the same GPU compute as a batch of 1. Each instance can run a different program, hold different register contents, start from different initial memory.
This property maps naturally onto hardware verification. Formal verification tools like CBMC and KLEE check program correctness symbolically: they represent state as logical formulas over symbolic variables and use SMT solvers to prove properties or find counterexamples. Symbolic execution is exact and exhaustive within its bounded scope but is expensive for programs with large state spaces or loops. It also does not integrate naturally with ML infrastructure; gradients do not flow through SMT solvers.
Batch neural CPU simulation covers the space statistically rather than exhaustively. It can miss corner cases that symbolic tools would catch, but it scales to millions of traces on a single device and produces a differentiable execution function. Define a loss over program outputs, backpropagate through thousands of simultaneous execution traces, and you have a gradient signal pointing toward initial states or instruction sequences that satisfy behavioral constraints. The straight-through estimator from Bengio et al. handles gradients through the hard threshold activations at the cost of some gradient bias. CBMC finds exact bugs through bounded symbolic search; batch neural simulation validates typical-case behavior across a distribution of inputs and opens the door to gradient-based program synthesis that symbolic tools cannot reach.
Measuring Circuits in Tensor Terms
The practical consequence of nCPU’s design is that it forces you to measure circuit properties in the language of tensor computation: depth as sequential layer count, width as matrix dimension, critical path as the minimum number of sequential forward passes per simulated clock cycle. Hardware architecture intuitions developed over 60 years of circuit design translate directly into neural network architecture intuitions under this measurement system.
The implication runs in both directions. Circuit techniques like carry-lookahead, which hardware architects developed to minimize critical path depth, apply directly to improving nCPU’s per-cycle efficiency. And neural architecture techniques developed for sequence modeling, including the prefix attention structures in Transformer-XL and the linear attention variants that reduce O(n²) width back down, offer guidance for circuit design problems that have the same structure.
The two fields have been using the same mathematical toolkit for the same class of problem. nCPU makes that overlap concrete by expressing a CPU in tensor terms, which strips away the substrate differences and leaves the shared structure visible.