Learning GPU Architecture by Building One: Why the Resources Gap Exists and What to Do About It

Someone on Hacker News posted a browser game called mvidia last week that lets you build a GPU from its architectural components. The author’s motivation was blunt: GPU architecture resources are lacking. The post hit nearly 800 points, which tells you the frustration is widely shared.

That frustration is legitimate, and the reasons for it are worth examining before evaluating whether gamification is the right fix.

Why GPU Architecture Is Hard to Learn

CPU architecture has a relatively clear pedagogical path. You start with logic gates, build adders and flip-flops, assemble a simple ALU, add a register file, wire up a control unit, and eventually you have something like the machine described in Noam Nisan and Shimon Schocken’s nand2tetris course. The abstraction layers stack cleanly. Each level is comprehensible before you move to the next.

GPU architecture resists this treatment for several structural reasons.

First, the interesting behavior is emergent. A single CUDA core is not complicated. What makes a GPU powerful is what happens when you have thousands of them, executing in lockstep under a warp scheduler, sharing a cache hierarchy that was designed around the assumption of massive spatial and temporal locality in parallel workloads. The unit of understanding is not the component but the interaction.

Second, NVIDIA and AMD have historically been reluctant to publish detailed microarchitectural documentation. Intel’s CPU architectures have had decades of public optimization guides with cycle-accurate latency tables. NVIDIA’s whitepaper for the Hopper architecture gives you marketing diagrams and feature lists; it does not tell you how the warp scheduler arbitrates between eligible warps, what the actual L1 cache replacement policy is, or how register bank conflicts manifest at the hardware level. Researchers at institutions like the University of Texas at Austin have had to reverse-engineer this information through microbenchmarks.

Third, the terminology is a mess. “Shader” means something different in the graphics pipeline context versus the compute context, and both usages coexist in most documentation. AMD calls their execution units Compute Units; NVIDIA calls theirs Streaming Multiprocessors. Both contain SIMD execution engines but with different lane widths and scheduling policies. Wavefronts on AMD hardware are 64 threads; warps on NVIDIA hardware are 32. These differences matter when you’re reading optimization advice.

What You Actually Need to Understand

A game that teaches GPU architecture needs to convey a handful of core concepts that most tutorials either gloss over or assume you already know.

The warp execution model. A GPU does not execute individual threads independently. On NVIDIA hardware, 32 threads are grouped into a warp and executed in lockstep on a single SM. All 32 threads execute the same instruction each cycle. If those threads diverge at a branch, the hardware serializes both paths, masking off the threads that should not execute each path. This is why branch divergence in GPU code is expensive in a way that has no direct analog in CPU programming.

Occupancy and the register file. Each SM has a fixed-size register file, typically 64KB or 256KB on recent hardware. The number of threads you can run simultaneously on an SM is bounded by how many registers your kernel uses. A kernel that uses 64 registers per thread on a SM with a 256KB register file can support at most 4096 simultaneous threads. Understanding this relationship is prerequisite to understanding why GPU performance is so sensitive to register pressure, and why compilers for GPU targets go to unusual lengths to minimize register usage.

Memory hierarchy and bandwidth. The gap between compute throughput and memory bandwidth is wider on GPUs than on CPUs, and the architecture reflects this. Shared memory (called Local Data Store on AMD) is a software-managed scratchpad that sits at roughly the same level as L1 cache in the hierarchy but is explicitly addressed by the programmer. Using it well requires understanding how to tile matrix operations or other access patterns to maximize reuse before going to the slower global memory. The Nvidia Ampere architecture whitepaper documents the L2 cache as 6MB on the GA102 die; the RTX 3090 has 24GB of GDDR6X behind it, and the bandwidth difference between L2 and GDDR6X is roughly 10x.

The tensor core pipeline. On hardware from Volta onward, NVIDIA added tensor cores that operate on matrix fragments rather than individual values. A single tensor core instruction on Ampere hardware performs a 16x16x16 matrix multiply-accumulate in mixed precision. This is qualitatively different from the CUDA cores underneath it, and understanding the distinction matters for anyone trying to understand why transformer inference throughput scales the way it does.

The Case for Learning by Building

The appeal of the mvidia game is that it forces you to confront these components as things that have to be wired together, not as a list of facts to memorize. This approach has a strong track record.

Nand2tetris, mentioned earlier, is the canonical example. The insight behind it is that you understand a CPU differently when you have had to implement the multiplexer that selects between the ALU output and the program counter. The act of deciding how to connect components forces you to reason about why they are connected that way.

Ben Eater’s breadboard computer series on YouTube is another example of the same principle applied to physical hardware. People who have built the 8-bit SAP-1 architecture on a breadboard reliably report understanding interrupts and bus arbitration in a way that reading about them never produced.

For GPU architecture specifically, there have been a few attempts at accessible simulators. Accel-Sim is a GPU architectural simulator used in academic research, but it’s a research tool first and a learning tool second; the barrier to entry is significant. The GPGPU-Sim project from UBC has similar characteristics. What has been missing is something with the approachability of nand2tetris: a structured path from nothing to something functional, with the complexity introduced incrementally.

That gap is what mvidia is trying to fill. A browser-based game removes the environment setup friction entirely. If it manages to introduce the warp execution model, the register file constraints, and the memory hierarchy in a sequence that builds intuition rather than just vocabulary, it is doing something genuinely useful.

What Good GPU Education Looks Like

The HN comments on the post surfaced some reasonable criticism alongside the enthusiasm. Several commenters noted that the hardest part of GPU programming is not understanding the architecture in isolation but understanding how a specific workload maps onto it, and whether a game can teach that mapping without also teaching CUDA or Metal or WGSL.

This is fair. The architecture knowledge and the programming model knowledge are not separable in practice. Understanding that a warp executes 32 threads in lockstep matters precisely because you are writing a kernel where you need to decide whether to diverge those threads. Understanding shared memory matters because you are trying to decide whether to tile a matrix multiplication.

The best GPU architecture resources thread this needle. Mark Harris’s cuda-samples repository includes optimization examples that walk through the matrix multiplication progression from naive to tiled to tensor-core-accelerated, with the architectural reasoning made explicit at each step. Simon Toth’s writing on GPU optimizations, and the Optimizing Parallel Reduction in CUDA document from Harris himself, both treat architecture and programming as mutually illuminating.

A game that taught you to build a GPU and then had you run simple kernels on it, observing how your architectural choices affected throughput on different workloads, would be close to ideal. Whether mvidia gets there is something you will have to find out by playing it.

The Broader Ecosystem

The author’s complaint that GPU architecture resources are lacking is accurate but the situation is improving. Chip Huyen’s writing on GPU infrastructure has reached a wide audience. The release of detailed performance analysis work around LLM inference, from Tim Dettmers and from the vLLM and FlashAttention papers, has forced a more explicit conversation about memory bandwidth, arithmetic intensity, and the roofline model than existed a few years ago.

The demand for GPU architecture literacy has grown because the number of people writing GPU code has grown, and the number of people reasoning about GPU-bound system performance has grown faster still. A browser game that lowers the floor on that education is a reasonable response to that demand.

The framing of it as a game rather than a tutorial is also worth noting. Tutorials require motivation that you have to bring with you. Games generate motivation as a byproduct of play. For a subject where the first several hours of study feel abstract and unrewarding, that difference in structure is not trivial.