· 6 min read ·

GPU Architecture Finally Has a Toy Model Worth Playing

Source: hackernews

GPU architecture education has always had a strange asymmetry with CPU education. You can learn how a CPU works from first principles using a dozen well-worn paths: NAND to Tetris, Ben Eater’s breadboard 8-bit computer, NandGame, Visual6502. Each of these gives you a toy model, something small enough to hold in your head, interactive enough to poke at, and faithful enough to the real thing that the intuitions transfer. GPU education has nothing comparable. The documentation exists, but it assumes you are already a GPU engineer or a graduate student working through a parallel computing course.

Something that appeared on Hacker News recently is trying to fix this. A developer named Jason built Mvidia, a browser-based game where you build and run a GPU. The name is a deliberate near-homophone of Nvidia. The premise is simple: the existing resources for GPU architecture education are sparse, so here is something interactive. The post scored 788 points and drew 161 comments, which signals that the frustration with existing resources is widely shared.

Why GPU Architecture Is Hard to Learn

The difficulty is not that GPUs are especially complex compared to CPUs. It is that GPU execution models are deeply counterintuitive without a concrete mental model of what the hardware is doing. CPU concepts like pipelining, branch prediction, and cache hierarchies have been explained through interactive simulators, games, and hardware kits for decades. GPU concepts like warp divergence, memory coalescing, and occupancy remain largely confined to dense documentation.

The fundamental execution model of a modern GPU is SIMT, Single Instruction Multiple Threads. On Nvidia hardware, 32 threads form a warp; those 32 threads execute the same instruction in lockstep on a Streaming Multiprocessor (SM). This sounds like standard SIMD, and in a narrow sense it is, but SIMT differs in that each thread has its own registers and program counter. Threads that diverge, taking different branches in an if/else block, cause the warp to serialize: the hardware executes both paths, masking off inactive threads on each pass. This serialization is the source of a class of GPU performance problems that are invisible without a concrete execution model to reason about.

Memory hierarchy compounds the difficulty. GPU registers are fast but scarce; spilling to local memory, backed by VRAM, is expensive. Shared memory (called LDS on AMD hardware) is a programmer-managed scratchpad within an SM or Compute Unit, fast but limited to around 48 to 96 KB depending on the generation. Global memory access latency on modern hardware runs between 400 and 700 cycles, orders of magnitude worse than register access. The hardware hides this latency by keeping many warps in flight simultaneously and context-switching between them when one stalls on a memory operation. The ratio of active warps to the theoretical maximum is occupancy, and tuning it is one of the core skills in GPU optimization.

None of this is secret knowledge. CMU’s 15-418/618 course covers it thoroughly and the lecture materials are freely available online. But lecture slides are passive. Reading that warp divergence causes serialization is different from watching it happen in a simulation and seeing throughput collapse in real time.

The Lineage of Interactive Architecture Education

The most successful projects in this space share a design pattern: they give you a minimal but accurate model of the hardware, let you build it piece by piece, and make the consequences of each design decision immediately visible. NAND to Tetris, the course by Shimon Schocken and Noam Nisan, walks you from boolean gates to a working computer in 12 chapters with a hardware simulator. The browser-based NandGame covers similar ground with less prerequisite knowledge. Visual6502 takes a different approach, simulating the MOS 6502 at the transistor level in the browser and letting you watch the actual die layout animate.

Sasha Rush’s GPU Puzzles is the closest prior art to what Mvidia is attempting. It presents a series of Python and Numba puzzles requiring you to write correct parallel code: start with a simple map, work up to matrix multiplication, use shared memory for tiling. The progression is well-designed and genuinely instructive. But GPU Puzzles requires writing code, which means it also requires knowing Python and Numba, and it never visualizes the hardware itself. You learn the programming model without seeing the scheduler.

Mvidia takes the hardware visualization approach. By framing it as a game, it lowers the entry barrier: you do not need to know any GPU programming language to engage with the execution model. This mirrors what NandGame did relative to NAND to Tetris, covering the same concepts with less prerequisite knowledge, immediately playable in a browser.

What the Game Format Gets Right

Games impose constraints that documentation does not. To make a level completable, the designer must decide which concepts are essential and which are simplifications. Every simplification is a pedagogical choice. The decision to model warps as groups of 32 threads rather than some arbitrary number encodes a real fact about Nvidia hardware. The decision to show memory stalls visually, rather than reporting them as a percentage after the fact, forces the player to confront the latency problem as it unfolds rather than in a post-mortem.

The game format also creates failure states, which documentation deliberately avoids. When your warp diverges and throughput drops, you experience the cost of divergence rather than reading about it. This is the same reason Ben Eater’s breadboard computer has been so effective: when a wire is wrong, nothing works, and you have to figure out why. Passive resources tell you what goes wrong; interactive ones make you feel why it matters.

There is also a compounding effect worth noting. GPU architecture is becoming relevant to a much wider audience than GPU engineers. The explosion of ML workloads has put GPU programming on the radar of developers who would never have touched it five years ago. Understanding what CUDA kernels are doing, why a matrix multiply benefits from tiling, why memory layout matters, why batch size interacts with occupancy, is increasingly practical knowledge rather than specialized trivia. Educational tools that reach this wider audience while the concepts are still unfamiliar do more work than documentation aimed at engineers who are already professionals.

What Would Make It More Complete

What Mvidia does well is make a start. What would extend it is coverage of the full graphics pipeline: vertex shading, rasterization, the render output units (ROPs), alongside the compute model. Mobile GPU architectures like those in Apple’s M-series chips use tile-based deferred rendering, deferring pixel shading until a full tile’s geometry is known, which dramatically reduces memory bandwidth and differs structurally from the immediate-mode rendering that Nvidia’s desktop GPUs use. A thorough GPU simulator would cover both execution models.

Ray tracing, tensor cores, and mesh shaders are further complications, though they are probably scope creep for a first version. The scope as it stands, modeling the core parallel execution model that underlies everything else, is the right place to start. Teaching someone what a warp is and why it diverges will serve them better than teaching them about RT cores they will not touch for years.

The CMU lecture slides exist. The CUDA Programming Guide exists. Nvidia’s architecture whitepapers for Ampere, Ada, and Hopper are publicly available and detailed. What was missing was something you could open in a browser and break, something where the cost of a bad decision shows up immediately rather than buried in a profiler output three hours later. Mvidia is that thing, and the reception it received on HN suggests there are a lot of developers who have been waiting for it.

Was this interesting?