· 6 min read ·

GPU Architecture Has Always Been Hard to Teach. A Browser Game Is Changing That.

Source: hackernews

There is a well-worn path for learning CPU architecture. You read Patterson and Hennessy, you work through nand2tetris, you maybe implement a RISC-V core in a hardware description language. By the end, you have a working mental model of the fetch-decode-execute cycle, register files, pipeline hazards, branch prediction, and cache hierarchies. The GPU has no equivalent path. It has CUDA documentation, a handful of vendor architecture whitepapers that read like marketing, and courses like CMU 15-418 that touch the subject but start from the programming model rather than the hardware.

That gap is exactly what mvidia is trying to close. The project, which surfaced on Hacker News with nearly 800 points, is a browser-based game where you design and assemble a GPU from its constituent parts. The author’s stated motivation was blunt: existing resources for GPU architecture are lacking, so they built something better.

The framing is right, and it points to a real structural problem in how GPU internals get taught.

Why GPU Architecture Is Hard to Teach

CPU architecture has a natural pedagogical ladder. Logic gates compose into adders, adders compose into ALUs, ALUs compose into datapaths, datapaths compose into pipelines. The abstractions stack cleanly, and the key ideas, hazards, forwarding, branch prediction, cache replacement policies, map reasonably well onto the programmer’s experience of writing sequential code.

GPUs break the ladder at nearly every rung. The execution model is SIMD at the hardware level but exposed through abstractions like warps (NVIDIA terminology) or wavefronts (AMD) that have no real CPU equivalent. A warp is a group of 32 threads that execute the same instruction in lockstep across 32 SIMD lanes. When threads within a warp diverge at a branch, the hardware serializes both paths, executing one set of threads while masking the others. This warp divergence concept is fundamental to understanding GPU performance, but it only makes sense once you understand why the hardware was built that way in the first place.

The reason is throughput. A CPU hides memory latency through deep caches and out-of-order execution. A GPU hides memory latency by having thousands of in-flight threads and switching between warp groups when one stall. This is called latency hiding through thread-level parallelism, and it explains why GPUs have enormous register files (to hold the state of many concurrent threads without spilling to memory), why shared memory exists as a programmer-controlled scratchpad at the streaming multiprocessor level, and why occupancy, the ratio of active warps to maximum possible warps on an SM, matters so much to performance.

None of this is intuitive from the programming side. You can write CUDA for years and still be fuzzy on why __syncthreads() only synchronizes within a block rather than globally, or why bank conflicts in shared memory cost what they cost, or what the actual difference between registers, shared memory, L1 cache, L2 cache, and global memory is at the hardware level. The programming model leaks hardware concepts everywhere, but it never explains them.

What Building It Actually Teaches

The pedagogical value of a build-it-yourself approach comes from forcing you to make the design decisions that usually stay hidden. When you implement a cache, you have to choose associativity, replacement policy, and line size. When you implement a warp scheduler, you have to decide between round-robin and greedy scheduling and see the effects directly. When you add more CUDA cores to your streaming multiprocessor, you have to weigh that against the register file and shared memory budget.

This is how nand2tetris works at its best. The book by Noam Nisan and Shimon Schocken does not just explain what a multiplexer does; it makes you implement one and then use it to implement a register and then use that to implement a RAM module. By the time you have a working computer, you have built intuitions that reading alone cannot give you.

For GPUs, the equivalent insight chain might look like: understand why having 2048 threads per SM requires a large register file, understand why that register file is divided into tiles to avoid a single huge crossbar, understand why that tile structure creates pressure on the compiler to minimize register usage, understand why that pressure is why nvcc sometimes spills to local memory and why that hurts. Each step explains a real behavior you will observe when writing GPU code.

Mvidia seems to be targeting exactly this chain. Building the GPU as a game, with components you assemble and configure, means the design constraints are legible in a way that a whitepaper never makes them.

The State of GPU Architecture Education

The best existing resources each cover a slice of the problem. Kirk and Hwu’s Programming Massively Parallel Processors is the standard GPU programming textbook, now in its fourth edition, and it does cover some hardware concepts, but it leads with the programming model. NVIDIA’s architecture whitepapers for Volta, Turing, Ampere, and Hopper are detailed and technically accurate, but they require substantial background to parse and they are not designed to build understanding from scratch. Fabian Giesen’s writing on GPU pipelines remains one of the best available explanations of the graphics pipeline, over a decade old now and still widely cited. Asahi Lina’s work reverse-engineering Apple’s GPU architecture for the Linux driver project has produced detailed public documentation that covers hardware details most vendor resources never publish.

But none of these is interactive, and none of them is structured around the act of designing the hardware yourself.

There is also a growing set of GPU architecture simulators in academic research: GPGPU-Sim, Accel-Sim, and MGPUSim are all used in computer architecture research to model GPU behavior. GPGPU-Sim in particular has been used in hundreds of papers and models NVIDIA GPU microarchitecture in substantial detail. These are serious tools, but they are also research infrastructure, not learning tools. The configuration surface is large, the documentation assumes prior knowledge, and there is no feedback loop designed to build intuition.

Why This Moment Makes Sense

GPU architecture education is more urgent now than it has ever been. The explosion of machine learning workloads over the last several years has pushed GPU programming from a specialty skill into something that a much broader population of engineers needs to understand. But most of that population arrived via PyTorch or JAX rather than via graphics or CUDA, and they are working with hardware they do not have a mental model for.

When a training run hits a memory bandwidth bottleneck rather than a compute bottleneck, understanding the distinction requires knowing that modern data center GPUs like the H100 have 3.35 TB/s of HBM3 bandwidth and 989 TFLOPS of FP16 tensor compute, and that for many transformer operations the arithmetic intensity is low enough that you exhaust bandwidth before you saturate the compute units. When you are trying to fuse kernels to improve performance, you need to understand why crossing a kernel boundary forces a round-trip through global memory. These are architectural facts, and you cannot reason about them effectively without the underlying model.

A game that builds that model through direct construction is a genuinely useful addition to what is currently available. The gamification matters less than the interactivity. What makes build-it-yourself learning work is not points or levels but the fact that you have to make choices and observe their consequences. When adding a component costs something, you start to internalize why real GPU designers made the tradeoffs they did.

The CPU architecture world has a rich ecosystem of tools at every level of abstraction: HDL simulators, pipeline visualizers, cache simulators, the nand2tetris platform, RISC-V emulators that let you run real code on simulated hardware. The GPU world is catching up slowly. Mvidia is a step in the right direction, and given that it came from someone who was simply frustrated that the resources did not exist, the broader ecosystem probably has more projects like it waiting to be built.

Was this interesting?