GPU Architecture Has No Nand2Tetris, and That's the Real Problem This Game Is Solving

Someone built a browser game where you construct a GPU and posted it to Hacker News with a one-line description: “Thought the resources for GPU arch were lacking, so here we are.” It landed nearly 800 points. That reception is not really about the game. It is about the sentence.

The GPU architecture education gap is real, and it has been real for a long time. If you want to understand how a CPU works at the circuit level, the ecosystem is generous. Nand2Tetris hands you a hardware description language, starts from a NAND gate, and walks you through a complete computer. Ben Eater builds a working 8-bit CPU on breadboards across a YouTube series that has become the canonical self-taught path for a generation of systems programmers. MIPS simulators like MARS and Venus are standard undergraduate curriculum. Logisim lets you wire up registers and ALUs with a mouse. The list goes on.

Now try to find the equivalent for GPUs. You will find Kayvon Fatahalian’s CMU 15-418 lecture slides, which are genuinely excellent and freely available. You will find NVIDIA’s PTX ISA documentation and the 2008 Lindholm et al. paper describing the Tesla architecture, which is still the clearest published description of SIMT execution. You will find GPGPU-Sim, an academic cycle-accurate simulator used in research papers. What you will not find is an interactive tool that lets a curious developer actually build a GPU piece by piece and understand why each piece exists.

Why GPU Architecture Is Harder to Teach Interactively

The CPU teaching tradition works because a CPU is, at its conceptual core, a sequential machine. You can explain fetch, decode, execute, writeback as a pipeline that processes one instruction at a time, add caches, branch prediction, and superscalar execution as complications on top of that foundation. The mental model scales.

GPUs broke from that model in a fundamental way. The whole point of GPU architecture is to execute thousands of threads simultaneously, and the design decisions that make that possible are deeply interconnected. You cannot teach SIMT execution in isolation. The reason NVIDIA’s GPUs group 32 threads into a warp and execute them in lockstep is inseparable from the reason that warp divergence kills performance, which is inseparable from the way the register file is partitioned, which is inseparable from why occupancy is the lever you reach for when a kernel underperforms.

The execution model is also not a single clean abstraction. Modern GPUs simultaneously handle a rasterization pipeline for graphics workloads, compute dispatch for GPGPU workloads, tensor operations for matrix multiply, and ray tracing hardware. These share silicon but operate through different programming models. A teaching tool has to make a choice about which slice to expose.

What the Core Concepts Actually Are

Any serious GPU architecture resource has to cover a small set of ideas that do most of the explanatory work.

SIMT, single instruction multiple threads, is the central one. Where a CPU SIMD unit applies one operation to a vector of data in a single lane, a GPU’s SIMT model runs independent threads that happen to execute the same instruction at the same time on different data. Each thread has its own registers and its own program counter, which means branches are possible, but when threads in a warp diverge at a branch, the hardware masks off threads that did not take the branch and runs both paths sequentially. The performance cost of divergence is not just overhead; it is the defining constraint that shapes how GPU code is written.

The memory hierarchy is the second major concept. GPU registers are per-thread and extremely fast, shared memory is per thread-block and programmer-managed, L2 cache is shared across the whole device, and global VRAM sits behind a high-bandwidth but high-latency bus. A GTX 4090 has 1008 GB/s of memory bandwidth, but global memory latency is hundreds of cycles. The entire programming model around coalesced memory access, where threads in a warp access contiguous addresses so the memory controller can serve them in a single transaction, exists to extract that bandwidth while hiding that latency.

Occupancy ties these two concepts together. An SM (Streaming Multiprocessor) can have many warps in flight simultaneously, and the hardware switches between them with zero overhead when one warp stalls on memory. The number of warps you can run at once is limited by register file size and shared memory allocation. High register usage per thread means fewer threads fit, which means fewer warps, which means less latency hiding, which means worse throughput. This three-way constraint between registers, shared memory, and warp count is the kind of thing that only becomes intuitive after you have watched your kernel’s occupancy collapse because you added one too many local variables.

The Pedagogy Case for Games

Zachtronics spent a decade demonstrating that game mechanics and low-level programming concepts are compatible. TIS-100 teaches assembly-style data flow. Shenzhen I/O teaches PCB design and embedded systems. The games work because constraint is the game. You are given limited resources, a target behavior, and the requirement that your solution runs correctly. The feedback loop is immediate.

GPU architecture maps well onto this format. Shader core budgets, register file limits, shared memory allocations, pipeline stages, memory bandwidth constraints: these are all resource constraints with measurable consequences. A game that lets you allocate compute units, wire up a memory hierarchy, and then watch a workload run through your design, with visible performance consequences for every decision, would teach the intuition that reading papers does not.

The nand2tetris approach of “build it from nothing” is compelling but exhausting at full fidelity. What matters for GPU architecture intuition is not transistor-level correctness but decision-level feedback: what happens when you add more shared memory and reduce register budget, what happens when you change warp size, what happens when you split a kernel into two passes to improve cache reuse. A game can abstract below the circuit level while still teaching the architectural tradeoffs that matter for real work.

Where the Gap Remains

The project at jaso1024.com is a starting point, not a curriculum. Getting 788 upvotes on Hacker News means the developer found the nerve; it does not mean the resource problem is solved. A single game, however well-constructed, covers one slice of a very wide subject.

What the field actually needs is something closer to what nand2tetris did for CPU architecture: a layered progression that starts from a minimal working GPU, introduces each complication with a clear motivation, and ends with something recognizable as a modern compute device. That project does not exist yet. The Mvidia game is the first move in a conversation that the hardware education community should have been having for years.

The fact that someone built it in a weekend and posted it with a sentence about lacking resources, and thousands of developers immediately recognized what they meant, is the most useful signal here. There is demand. The question is whether anyone builds the thing that fully answers it.