· 6 min read ·

Building a GPU to Learn One: The Educational Gap mvidia Fills

Source: hackernews

Someone built a browser game where you construct a GPU from its component parts, and it landed on Hacker News with nearly 400 points. The creator, Jason, put it plainly in the submission: the resources for GPU architecture were lacking, so he made one. The project is called mvidia, an obvious nod to NVIDIA, and the name tells you most of what you need to know about the tone.

The gap he’s pointing at is real. GPU architecture sits in a strange position in the educational landscape: extremely relevant now that every machine learning pipeline runs on one, and yet poorly served by anything that prioritizes learning over reference material.

What Exists and What It Misses

The standard texts for GPU programming are technically thorough but pedagogically awkward. Kirk and Hwu’s Programming Massively Parallel Processors is the canonical university textbook; it covers warps, occupancy, memory coalescing, and tiling with real rigor. NVIDIA’s own CUDA Programming Guide and architecture whitepapers for Volta, Ampere, and Hopper are accurate and detailed. But both assume you are already motivated, and neither gives you the sense of building something up from scratch.

The research simulators fare worse. GPGPU-Sim, developed at UBC, is a cycle-accurate simulator of NVIDIA-style GPUs used extensively in academic work. It is not something you install in an afternoon. Accel-Sim from Georgia Tech extends the approach with trace-driven simulation, and gem5 has a GPU model. All of these are research infrastructure, not learning tools.

The most interesting prior art for learners is probably GPU Puzzles by Sasha Rush, released in 2023. It uses Python and Numba CUDA with visual output to teach parallelism concepts through a sequence of exercises: mapping, zipping, broadcasting, shared memory, tiling, matrix multiply. It accumulated thousands of GitHub stars quickly, which suggests people were hungry for something structured. The limitation is environmental: you need a Python setup, a working Numba installation, and ideally an actual GPU to run it meaningfully.

On the games side, Turing Complete does something similar for CPU architecture. You build a computer from logic gates up through an ALU, register file, and instruction set decoder, accumulating over 10,000 positive Steam reviews. It is the closest thing to a playable nand2tetris, and it works because the medium fits the subject: the incremental feedback loop of a game maps well onto the incremental construction of a processor.

The GPU equivalent has been missing. mvidia is a browser-native attempt at filling it.

Why GPU Architecture Is Harder to Teach

The CPU model has a natural narrative arc. You start with gates, build an adder, build an ALU, attach registers, connect a program counter, and eventually you have something that can execute instructions. The abstraction stack from transistors to programs is steep but linear. nand2tetris formalizes this arc as a course, and it works because each layer has a clear purpose before the next one is introduced.

GPU architecture resists that linearity. The interesting properties of a GPU are not about sequence; they are about parallelism at every level simultaneously, and the constraints that parallelism imposes on memory access, branching, and scheduling. A Streaming Multiprocessor on an NVIDIA Ampere GPU contains 64 FP32 CUDA cores, 4 warp schedulers, a 256 KB register file, and up to 192 KB of configurable L1 cache and shared memory. The A100 has 108 such SMs, giving 6,912 CUDA cores in total, backed by 40 MB of L2 cache and 80 GB of HBM2e running at roughly 2 TB/s of bandwidth.

Those numbers mean nothing without understanding what a warp is, why there are 32 threads in one, and how the scheduler uses them to hide latency. When a warp stalls waiting on a memory load, the scheduler does not spin; it issues instructions from a different resident warp instead. An SM on Ampere can hold up to 64 warps simultaneously, and maintaining enough resident warps to keep the execution units busy is the central engineering problem of GPU programming. Occupancy is the ratio of active warps to the maximum the SM can hold, and it is limited by how many registers each thread uses and how much shared memory each block requests. Push those numbers up and fewer warps fit.

This is the kind of system where the pieces only make sense together, and no sequence of static documentation makes the dependencies obvious. You need to manipulate them.

The Memory Hierarchy Problem

For any game or simulator teaching GPU architecture to be worth the time, it has to make the memory hierarchy concrete. This is where most programmers lose time on GPUs, and it is the piece most documentation treats as an afterthought.

The latency numbers span three orders of magnitude. Registers are available in a single cycle. L1 cache and shared memory sit at roughly 20 to 30 cycles. L2 cache adds another order of magnitude. Global DRAM, even fast HBM2e, runs at around 600 cycles of latency. The bandwidth numbers are similarly dramatic: the aggregate shared memory bandwidth across all SMs of an A100 dwarfs the L2 bandwidth, which in turn dwarfs what you get from DRAM.

Memory coalescing makes this concrete. When 32 threads in a warp each access a 4-byte float at addresses that are contiguous in memory, the hardware coalesces those accesses into a single 128-byte transaction. If the threads access scattered or strided addresses instead, that one transaction becomes many, burning bandwidth on overhead. Bank conflicts in shared memory introduce a different failure mode: shared memory is divided into 32 banks, and if multiple threads in a warp access addresses in the same bank simultaneously, those accesses serialize.

A game mechanic built around coalescing could be genuinely effective. You are arranging threads and data layouts in a visual environment, and you can watch the number of memory transactions change in response. That kind of immediate feedback is exactly what documentation cannot give.

Warp Divergence as a Teachable Moment

Warp divergence is the other concept that clicks better through play than through description. In SIMT execution, all 32 threads in a warp run the same instruction at the same program counter. When a branch splits the warp, with some threads taking the true path and some taking the false path, the hardware serializes them: first all threads where the condition is true execute with the others masked off, then the reverse. Both paths run, doubling execution time for that section.

NVIDIA introduced Independent Thread Scheduling in the Volta architecture, giving each thread its own program counter and enabling more flexible reconvergence. But the fundamental cost of divergence remains, and writing branch-heavy code without understanding warps produces kernels that perform far below what the hardware is capable of.

A game that puts you in the position of a warp scheduler, where you watch threads diverge and need to restructure your code to minimize it, would encode this intuition in a way that no amount of reading achieves.

Why This Matters Now

GPU architecture used to be something graphics engineers cared about. That changed when CUDA became the substrate for deep learning, and it changed again when WebGPU shipped in browsers, giving web developers access to GPU compute for the first time without a plugin. The community of people who need to understand what happens below the shader is much larger than it was a decade ago.

The demand shows up in the reception of projects like GPU Puzzles and in the HN points for mvidia. People want a path from zero to functioning mental model that does not require reading a 500-page textbook or setting up a research simulator. The game format is a reasonable bet for making that path accessible; the precedent from Turing Complete and nand2tetris supports it.

What Jason has built at jaso1024.com/mvidia is an early experiment in that direction. Whether the mechanics successfully encode the concepts that actually matter, warps and scheduling and memory coalescing and divergence, is something you find out by playing it. The fact that it exists at all, that someone noticed the gap and built something rather than waiting for NVIDIA to write better documentation, is the more interesting signal.

Was this interesting?