The Missing Middle Ground in GPU Architecture Education

The GPU education landscape has a peculiar shape. On one side you have shader programming tutorials: WebGPU Fundamentals, The Book of Shaders, Shadertoy, all teaching you to write code that runs on a GPU without much explanation of what the GPU is doing with it. On the other side sit vendor whitepapers and academic papers describing microarchitecture at the level of transistor counts and memory subsystem topology. The middle ground, where someone explains how the hardware actually executes your wavefront or dispatch or draw call, is thin.

mvidia starts from that frustration. The author’s stated motivation is spare: “Thought the resources for GPU arch were lacking, so here we are.” The project is a browser-based game where you construct a GPU by assembling its components, learning the architecture through the act of building it. The HN reception was immediate: 788 points and 161 comments, which suggests the gap resonates with a lot of people who have spent time around GPUs without ever feeling they properly understood them.

The nand2tetris Model and Why GPUs Need Their Own Version

The gold standard for bottom-up hardware education is probably nand2tetris, the course that takes students from a single NAND gate through to a functioning computer with an assembler and operating system. The course works because CPUs, at their conceptual core, are sequential state machines. The fundamental abstraction, a register that holds a value, an ALU that transforms it, a program counter that advances, maps cleanly onto how most programmers already think about computation.

GPUs break that model almost immediately. The entire point of a modern GPU is to execute thousands of threads simultaneously, and the hardware is structured around that constraint in ways that make sequential reasoning about execution unreliable. The concepts that matter most in GPU architecture are specifically the ones that have no direct CPU analogue: warp and wavefront execution, SIMD lane masking for divergent branches, the distinction between shared memory and global memory and why it matters for latency, and occupancy as a function of register pressure and shared memory allocation per thread block.

Ben Eater’s breadboard CPU series shows how far physical construction can take you in understanding a sequential processor. The registers, the bus, the control logic, all of it becomes legible when you wire it together. A GPU equivalent cannot be physical in the same way, because the scale is the point. A modern NVIDIA H100 has 528 streaming multiprocessors, each containing 128 CUDA cores, with a memory subsystem running at 3.35 TB/s aggregate bandwidth. The architecture is only meaningful at scale, which makes a game or simulation a natural teaching format: you can compress the scale while preserving the structural relationships between components.

What GPU Architecture Actually Requires You to Understand

The execution model that distinguishes GPUs from CPUs is worth unpacking in detail, because it is the thing most shader tutorials skip entirely.

A GPU does not run individual threads independently. Threads are grouped into warps (on NVIDIA hardware, 32 threads) or wavefronts (on AMD hardware, 64 threads). All threads within a warp execute the same instruction on each clock cycle, but on different data, SIMD in the strictest sense. When threads in a warp take different branches through an if/else, the GPU serializes the paths: it executes the if branch with lanes masked off for threads that took else, then executes the else branch with the opposite mask. Both branches execute, and the wasted cycles compound as divergence increases. This is why GPU programmers think about branch divergence in a way that CPU programmers do not.

Memory is the other axis. The GPU memory hierarchy rewards spatial and temporal locality even more aggressively than CPU caches. Registers are per-thread and extremely fast, but the total register file size per streaming multiprocessor is fixed, so using more registers per thread reduces the number of active warps, which reduces the hardware’s ability to hide memory latency by switching between them when one stalls. Shared memory is manually managed cache that all threads in a block can access, and careful use of it is often the difference between a kernel that saturates memory bandwidth and one that stalls on every global load.

Understanding why all of this is true requires understanding the hardware that enforces it. The streaming multiprocessor (SM on NVIDIA, compute unit on AMD) is the fundamental building block: a collection of SIMD execution units, a register file, a shared memory bank, warp schedulers, and texture units. A full GPU is dozens to hundreds of these units connected to a shared memory subsystem. The rasterization pipeline adds another layer: vertex shaders feed into primitive assembly, rasterization converts triangles to fragments, fragment shaders run per-fragment, and the output merger blends results into the framebuffer. Each stage has its own fixed-function and programmable components, and understanding the handoff between them is not obvious from writing shaders.

Why Interactive Construction Works as a Teaching Format

The game format is a reasonable answer to a genuine pedagogical problem. Reading a description of a warp scheduler does not produce the same mental model as seeing how it connects to the register file, the execution units, and the memory system. The spatial relationships between components carry information that prose descriptions compress out.

This is the same insight behind Visual6502, which renders a transistor-level simulation of the 6502 in the browser, and behind the various logic gate simulators that have made combinational circuit design accessible without dedicated EDA software. The browser has become a reasonable substrate for interactive hardware education precisely because the rendering and interactivity costs are low, and the distribution is frictionless.

The timing also makes sense. GPU programming has expanded well beyond graphics. CUDA powers most of the ML training infrastructure in production today. WebGPU has brought compute shaders to the browser with a modern API. ROCm has made AMD hardware substantially more programmable. The number of developers who write GPU code has grown considerably over the last five years, and most of them learned to write shaders or CUDA kernels without developing a solid model of the hardware those programs run on. That mismatch shows up in performance: kernels that look correct are often slow because the programmer did not account for occupancy, or wrote a memory access pattern that serializes rather than coalesces, or chose a thread block size that underutilizes the SM.

The Resources That Exist and What They Leave Out

The existing GPU architecture resources are good but specialized. NVIDIA’s Ampere whitepaper is detailed and accurate, but written for an audience already fluent in GPU terminology. Programming Massively Parallel Processors by Hwu, Kirk, and Hajj covers CUDA programming and underlying architecture thoroughly, but it is a textbook with the density that implies. The CUDA C++ Programming Guide is comprehensive reference documentation. None of these are designed for someone who wants to build intuition from scratch, starting from components and working up.

There are simulation-adjacent tools. GPGPU-Sim is an academic GPU simulator used in computer architecture research, capable of running real CUDA binaries against a modeled GPU, but it is a research tool, not a learning one. Macsim is similar in spirit. These projects confirm that simulating GPU execution in software is tractable; what they lack is any pedagogical layer.

What mvidia represents is an attempt to fill the bottom of the stack: a resource that lets someone who has never thought carefully about GPU architecture start building one, make decisions about how components connect, and see how those decisions constrain what the hardware can do. The project name, a deliberate play on NVIDIA, signals the target audience clearly. Whether the game mechanics are tuned well enough to reinforce the right intuitions is something only working through it can answer. The gap it is aimed at is genuine, and the response it received on HN suggests that a meaningful number of developers who work with GPUs have run into that gap and wished something like this existed.