The Missing nand2tetris for GPU Architecture

The execution model of a modern GPU is one of those things that takes a long time to really click. I remember reading NVIDIA’s CUDA programming guide expecting the mental model to arrive neatly, and finding instead a taxonomy of terms, warps, thread blocks, streaming multiprocessors, occupancy, coalescing, that referenced each other in ways that made sense only once you already understood the whole system. That circularity is exactly the problem a project called mvidia is trying to solve, by making you build one.

The project appeared on Hacker News with the creator’s honest framing: the resources for GPU architecture were lacking, so they built an interactive game to fill the gap. It hit 788 points and generated substantial technical discussion, which suggests the gap is genuine and the approach resonates.

Why GPU Architecture Is Harder to Teach Than CPU Architecture

CPU architecture has an unusually good education story. nand2tetris takes you from NAND gates to a working operating system across a semester. The Turing Complete game covers similar ground through puzzle mechanics. Patterson and Hennessy’s Computer Organization and Design has served as the standard university text for decades. The mental model is sequential: fetch, decode, execute, writeback, repeat. It maps onto how humans naturally reason about programs running one instruction at a time.

GPUs do not map cleanly onto sequential intuition. The architecture exists to solve a specific problem: performing the same operation on thousands of data elements simultaneously, at the cost of control flow flexibility. That design decision propagates through every layer of the hardware in ways that have no clean CPU analogue.

A modern NVIDIA GPU groups 32 threads into a unit called a warp. All threads in a warp execute the same instruction at the same time, operating on different data, under a model NVIDIA calls SIMT (Single Instruction, Multiple Threads). This is SIMD with a threading abstraction layered on top. When threads in a warp take different branches, the hardware serializes the divergent paths and uses predicate masks to suppress writes from threads on the inactive path. Code that appears parallel on the surface becomes sequential whenever branches across the warp do not converge quickly.

That behavior is not intuitive and does not translate directly from any CPU concept most programmers know. The closest comparison is vector intrinsics with masking, but even experienced systems programmers who write AVX2 code sometimes need a moment to internalize what warp divergence really costs in terms of throughput. A game that forces you to reason about this at the hardware level, where the consequences of your architectural choices are visible in the system’s behavior, is a more effective teaching tool than any amount of prose.

The Memory Hierarchy Is the Real Puzzle

If the warp execution model is what surprises people first, the memory hierarchy is what determines whether code is actually fast in practice.

A GPU Streaming Multiprocessor (SM in NVIDIA’s terminology, Compute Unit in AMD’s) has a register file, shared memory accessible to all threads in a block, and L1/L2 caches that feed from global VRAM. The bandwidth at each level differs by orders of magnitude. A register read is essentially free. Shared memory access is fast. Global memory access carries significant latency, and if that access is not coalesced, meaning threads in a warp are not reading consecutive addresses that can be merged into a single wide transaction, the memory controller has to issue multiple smaller transactions and most of the available bandwidth goes to waste.

This is the mechanism behind performance differences that can look inexplicable at the source level. Two kernels with nearly identical code can differ by a factor of ten because one has a strided access pattern that defeats coalescing. Understanding why requires knowing what the memory controller is doing at the transaction level. Documentation explains this with diagrams and access pattern tables. A game that makes you design the memory subsystem yourself and observe what happens under coalesced versus strided access builds the intuition in a more direct way.

Occupancy adds another layer. Each SM has a limited pool of registers, shared memory, and thread slots. If your kernel uses many registers per thread, fewer warps can be resident on the SM simultaneously, which reduces the GPU’s ability to hide memory latency by switching to another warp while a load is in flight. The relationship between register pressure, shared memory usage, thread block size, and resulting occupancy is exactly the kind of tradeoff that takes time to internalize from documentation but becomes clear when you are deciding how to allocate hardware resources in a simulation.

The Interactive Model and What nand2tetris Got Right

The success of nand2tetris is worth examining carefully. The course does not just explain how computers work; it asks you to build each layer from scratch, with each layer’s output feeding into the next. Logic gates lead to an ALU, an ALU leads to a CPU, a CPU leads to an assembler, and so on up to a running virtual machine. The insight behind the pedagogy is that building a simplified version of a system gives you the intuition that reading about the full system does not.

Turing Complete applies this to a game format, adding progression and puzzle mechanics to keep the process engaging. Both approaches rely on the same mechanism: you cannot fake your way through building something. If you do not understand how carry propagation works in your ALU, the adder will not compute correct sums, and the game will block your progress until it does.

Designing a GPU equivalent of this is harder than designing a CPU equivalent. The challenge is that the concepts that matter most in GPU architecture, warp scheduling, memory coalescing, occupancy, the pipeline from vertex processing through rasterization to fragment shading, form a more complicated dependency graph than the sequential CPU pipeline. A CPU can be built up in a clean vertical stack. GPU architecture branches horizontally into parallel execution units, fixed-function pipeline stages, and a memory subsystem that interacts with all of them.

Building a simplified model that captures real tradeoffs without drowning in vendor-specific microarchitecture details requires careful judgment about what to leave in and what to abstract away. Getting that balance right is the hard design problem in any educational tool of this kind.

The Existing Landscape Is Thin

For CPU architecture, the options stack up well: nand2tetris, Turing Complete, the Ben Eater breadboard series on YouTube, the Visual6502 simulator, and a shelf of textbooks spanning multiple decades. For GPU architecture, the landscape thins out quickly.

Fabian Giesen’s A Trip Through the Graphics Pipeline remains one of the best long-form technical resources available, covering the fixed-function rendering pipeline from API call to final pixel with real depth. It is excellent, but it is a long-form read with no interactivity, and it predates much of the compute workload context that dominates GPU discussion in 2026. The CUDA C Programming Guide is thorough, but it assumes you are already writing CUDA code and approaching architecture from the programmer’s side rather than building up from hardware. Vendor whitepapers describing SM microarchitecture require enough background to parse that they are effectively inaccessible to someone without prior exposure.

The gap the mvidia creator identified is real. There is no clean, progressive, interactive path for someone who understands CPUs and wants to build genuine GPU architecture intuition. Most resources assume you are either a graphics programmer learning an API or a hardware engineer reading silicon documentation. The space between those two positions is sparsely populated.

What a Game Teaches That Documentation Cannot

The value of the game format is not just engagement. It is the feedback loop. When you misunderstand how warp scheduling works, the game shows you the consequence: low occupancy, poor utilization, a specific bottleneck you can trace back to a decision you made about the hardware. That feedback compresses what would otherwise take weeks of CUDA kernel profiling, Nsight Compute sessions, and careful reading of occupancy calculator documentation into something you can experience in a sitting.

The simplifications are real, and they matter. Whatever hardware mvidia models differs from what is inside an actual RTX 5090. Behaviors that matter in production, the specifics of L2 cache partitioning, how shared memory bank conflicts interact with different access widths, how successive GPU generations have changed warp divergence handling, will not all be present in a simplified educational model. That is not a criticism; it is a necessary property of any educational abstraction. The alternative is not a perfect simulation of current hardware. The alternative is the sparse landscape that already exists.

For someone trying to build real GPU architecture intuition before diving into CUDA programming, graphics pipeline optimization, or GPU-side ML inference work, a game that makes you think about warp schedulers, memory hierarchies, and execution units as concrete design choices you make and observe is more valuable than another long-form read. mvidia is a genuine attempt to fill a gap that has been sitting open for a long time.