Learning GPU Architecture by Building One: Why This Approach Works

GPU architecture sits in an awkward educational position. There is plenty of material at the extremes: you can read GPU vendor whitepapers full of marketing diagrams, or you can read Hennessy and Patterson cover-to-cover for rigorous computer architecture fundamentals. What’s harder to find is a middle path that makes the hardware model concrete and interactive without requiring access to real silicon or a PhD-level course. A new browser game called mvidia is trying to fill that gap by putting the player in the role of GPU designer, and the 788-point reception on Hacker News suggests the gap is real.

The Actual Problem with GPU Architecture Learning

Most GPU education resources are really CUDA education resources in disguise. You learn about grids, blocks, and threads; you learn to write kernels; you learn that warps are 32 threads that execute in lockstep. But the hardware underneath those abstractions stays opaque. Why 32 threads per warp? Why does a bank conflict in shared memory cost what it costs? Why does occupancy matter, and what is actually competing for what?

NVIDIA’s own documentation is thorough but structured around the programmer model, not the hardware model. The CUDA C Programming Guide explains what happens at each abstraction layer without showing you how the layers connect to physical execution units. AMD’s RDNA architecture whitepapers are more candid about the hardware, but they assume you already know what a Compute Unit is and why it’s organized the way it is.

Academic simulators exist. GPGPU-Sim can simulate full GPU workloads at a microarchitectural level, but its entry cost is high: you need to compile it, configure it for the architecture you’re studying, and interpret its output without much hand-holding. It’s a research tool, not a learning tool.

What Building Actually Teaches

The mvidia game follows an approach that has proven effective in other hardware education contexts: make the learner assemble the system from components and observe the consequences of their design choices.

This is the same intuition behind Nand2Tetris, the course and book by Noam Nisan and Shimon Schocken that walks you from basic logic gates up through a complete computer. The course works because each layer of abstraction is built by the student rather than handed down as a given. The moment you implement a multiplexer yourself, you stop thinking of it as a black box. Ben Eater’s breadboard computer series on YouTube does something similar in physical hardware, and its popularity is a reasonable proxy for how many programmers want exactly this kind of bottom-up understanding.

GPUs introduce challenges that CPUs don’t, which makes the build-it-yourself approach even more valuable. The central design tension in GPU architecture is between parallelism and latency hiding. A GPU does not reduce memory latency; it hides it by switching to another warp while the memory request is in flight. Understanding this at a visceral level requires understanding what a warp scheduler actually does, how many in-flight warps are needed to saturate execution units, and what occupancy means for this calculation.

Occupancy, specifically, is one of those concepts that makes perfect sense once you see the hardware it maps to. Each Streaming Multiprocessor on an NVIDIA GPU has a fixed register file and a fixed amount of shared memory. Kernels that use more registers per thread can run fewer concurrent warps. Kernels that use more shared memory per block leave less for other blocks. The hardware is a set of physical resources being divided among concurrent workloads, and a tool that makes those physical resources visible and configurable teaches the concept in a way that documentation simply cannot.

The Architecture Itself

For readers who haven’t spent time with GPU internals, a brief sketch of what a modern GPU actually contains helps contextualize what a game like this might simulate.

NVIDIA’s Ampere architecture organizes compute into Streaming Multiprocessors (SMs). Each SM in the GA102 die, for instance, contains four processing blocks, each with 32 FP32 CUDA cores, 16 FP16 Tensor Core operations per clock, one warp scheduler, and one dispatch unit. Shared memory and the L1 cache share a 128KB pool per SM that the programmer can configure. A full GA102 has 84 SMs, giving 10,752 CUDA cores.

The memory hierarchy adds another layer of complexity. L1 is per-SM and fast. L2 is shared across all SMs and slower. VRAM (GDDR6X on high-end consumer parts) has high bandwidth but significant latency, which is where the warp-switching strategy pays off. Getting data movement right at each level is the central challenge of GPU optimization.

A rendering-focused GPU simulation would also need to model the fixed-function units that sit alongside the shader cores: the rasterizer that converts triangle geometry to fragments, the ROP (Render Output Unit) that handles blending and depth testing, and the texture units that sample from texture memory with filtering. These units are why “compute” and “graphics” workloads have different bottlenecks on the same hardware.

Teaching all of this through building means the learner has to make decisions: how many SMs to include, how to size the register file, how to balance shared memory against cache, where to place texture units relative to shader clusters. Each decision creates a simulated consequence, which is a much tighter feedback loop than reading about the same tradeoffs in a whitepaper.

Prior Art in Interactive Hardware Education

The visual simulation space for processors has some history worth knowing. The Visual6502 project built a transistor-level simulation of the MOS 6502 processor from reverse-engineered die photographs. You can step through instructions and watch individual transistors change state. It is not a learning tool in the pedagogical sense, but it demonstrates that there is real interest in hardware that you can watch work.

Nand2Tetris is the strongest precedent for the approach mvidia is taking. The course has a large following, a Coursera presence, and a companion book. Its success argues that programmers are willing to invest significant time in bottom-up hardware understanding when the material is well-structured and the feedback loop is tight.

For GPU-specific education, Simon’s GPU Talk from Strange Loop and similar conference presentations have reached large audiences by doing the work of explaining hardware from first principles for an audience of software engineers. The demand clearly exists; the supply of good introductory material has not kept up.

What Makes This Timing Interesting

The context around GPU architecture literacy is different now than it was five years ago. CUDA used to be primarily a concern for graphics programmers and scientific computing practitioners. The explosion of ML training and inference workloads has put GPU programming in the path of a much larger population of developers. People who would previously have been satisfied with a high-level PyTorch understanding now have reasons to care about tensor core utilization, memory bandwidth, kernel fusion, and flash attention’s memory access patterns.

Flash attention, to take one concrete example, is fundamentally a memory hierarchy optimization. The algorithm restructures the attention computation to fit working data into shared memory rather than repeatedly hitting VRAM. You can read the paper and follow the math, but the design insight is much more legible once you have a mental model of what shared memory is, how large it is, why it’s fast, and what it costs to use it. A game that builds that mental model from scratch makes subsequent algorithmic papers more accessible.

This is the downstream value of tools like mvidia. They are not replacements for CUDA programming experience or for reading the architecture guides, but they can build the conceptual scaffolding that makes everything else easier to absorb. The learner who has configured a warp scheduler, observed the tradeoff between register file size and occupancy, and watched a simulated kernel stall on memory latency will read NVIDIA’s architecture documentation differently.

What to Watch For

The game is early, as the creator’s framing makes clear. “Thought the resources for GPU arch were lacking, so here we are” is a dev scratching their own itch and shipping it. Whether it develops into something with the pedagogical depth of Nand2Tetris or stays a lighter demonstration depends on where the author takes it.

The important things are the simulation fidelity (do the tradeoffs in the game correspond to real tradeoffs in real hardware?) and the feedback clarity (does the player understand why a configuration choice produced the result it did?). Getting both right is genuinely hard. Hardware simulation that’s too faithful becomes GPGPU-Sim; simulation that’s too abstract teaches intuitions that don’t transfer. The middle path requires care.

For now, it is worth playing through and watching the HN thread develop. The comments on posts like this often surface related tools, course recommendations, and architecture clarifications that are hard to find through search. The fact that it hit 788 points with 161 comments tells you something about the audience size for this kind of material: it is larger than you might expect, and it is not getting smaller.