Building a GPU Is a Better Architecture Lesson Than Any Whitepaper

The GPU architecture education gap has been a persistent frustration for developers who want to understand what happens below the CUDA or shader API surface. NVIDIA publishes architecture whitepapers, but these describe features and throughput numbers rather than how the hardware actually works. The CUDA Programming Guide documents the programming model thoroughly, but it deliberately abstracts away the microarchitecture. Patterson and Hennessy’s computer architecture textbook added GPU coverage in recent editions, but it reads as an appendix rather than a first-class treatment. So when someone ships a browser game that teaches GPU architecture by having you build one, and it lands near the top of Hacker News with nearly 800 points, the reaction makes sense.

Why GPU Architecture Is Hard to Learn

The difficulty comes from a structural mismatch. CPU architecture education has decades of scaffolding: from the von Neumann model through pipelining, out-of-order execution, branch prediction, and cache hierarchies, each concept builds on the last. The mental model maps reasonably well to how you write sequential code. GPU architecture lacks that foundation for most developers.

The entry point for most is the CUDA or OpenCL programming model, which exposes threads, blocks, and grids. The documentation tells you these map onto hardware resources, but the mapping is indirect. You learn that threads are grouped into warps of 32, that warps execute in lockstep on a Streaming Multiprocessor, that divergent branches serialize execution, that shared memory is banked and bank conflicts hurt performance. These facts are learnable from documentation and blog posts. What takes longer to build is intuition for why the hardware is designed this way, and what the tradeoffs look like when you are the one making design decisions.

What GPU Architecture Actually Covers

A complete mental model of a modern GPU requires understanding several distinct layers. The compute units, called Streaming Multiprocessors on NVIDIA hardware and Compute Units on AMD, contain groups of CUDA cores that execute arithmetic operations. Each SM runs many warps simultaneously; when one warp stalls waiting for memory, the scheduler switches to another warp without any context-switch overhead because all warp state lives in a fixed register file. This latency-hiding through massive parallelism is the central architectural idea, and it drives nearly every other design decision downstream.

The memory hierarchy is layered similarly to a CPU hierarchy but with different tradeoffs. Registers are fast and plentiful, but the total register count is fixed per SM; using more registers per thread means fewer threads can run concurrently, which reduces the pool of warps available to hide latency. Shared memory sits on-chip per SM and serves as a programmer-managed L1 cache, giving explicit control over data locality. L2 cache is shared across the full chip. Device memory, GDDR6 or HBM depending on the product tier, is large and slow; accessing it efficiently requires coalesced access patterns where threads in a warp read from contiguous addresses.

Fixed-function pipeline hardware, covering rasterization, texture sampling, and render output, sits alongside the programmable shader cores. Modern GPU design has steadily enlarged the programmable fraction, but dedicated fixed-function blocks still handle the most common operations faster than a general-purpose shader could. The tension between flexibility and fixed-function throughput is a design constraint that anyone building a simulated GPU has to confront directly.

Prior Art in GPU Simulation and Education

The idea of learning GPU architecture through simulation has a history. GPGPU-Sim, developed at the University of British Columbia, is a cycle-accurate simulator for NVIDIA GPUs used in academic research since 2009. It is detailed enough to reproduce measured performance on real hardware, but that detail makes it impractical as an introductory learning tool; getting it configured and running is a project in itself, and it assumes familiarity with GPU microarchitecture before you start.

TinyGPU by Adam Majmudar takes a different approach, implementing a minimal GPU in Verilog with enough features to run basic parallel computations. It covers shader cores, a basic memory interface, and a simplified dispatch mechanism. The code is readable and the README walks through the design, making it a reasonable starting point for developers comfortable with hardware description languages. Most application developers are not comfortable with Verilog, which limits the audience considerably.

In the graphics education space, Scratchapixel teaches the rasterization pipeline by having you implement it in software. “A Trip Down the Graphics Pipeline” by Jim Blinn, published through Microsoft Research across the 1990s, remains one of the clearest explanations of fixed-function pipeline stages despite its age. Resources like the Vulkan Tutorial and GPU architecture talks from NVIDIA’s GTC conference have filled in gaps over the years, but they still stop short of explaining hardware implementation choices and the constraints that produced them.

Simulation as Pedagogy

Hardware architecture courses in universities use simulators for a reason. When you are forced to implement a scheduling policy, you discover its tradeoffs concretely rather than abstractly. A warp scheduler that prioritizes warps with ready operands performs differently under different memory pressure profiles. Shared memory bank conflicts become intuitive when you are the one deciding how banks are laid out and watching throughput drop when multiple threads hit the same bank simultaneously.

Games add a layer beyond simulation by introducing feedback loops and goals. A game that lets you configure SM count, register file size, shared memory allocation, and memory bus width, then shows how those choices affect performance on different workloads, encodes architectural knowledge in a way that documentation cannot match. The constraint of a limited transistor budget forces reasoning about tradeoffs rather than accumulation of isolated facts.

The mvidia game sits in a relatively sparse space. Most hardware education games target CPU architecture or digital logic. Nandgame, for instance, builds a computer from NAND gates upward through a series of progressively more complex abstractions. GPU architecture has been largely absent from that genre. The creator’s stated motivation, that resources for GPU architecture were lacking, is accurate; most of what is available either requires significant hardware background or stays at the level of programming model documentation without touching the implementation layer.

What This Represents for the Field

Interest in GPU architecture has grown well beyond graphics programming. CUDA and GPU compute underpin the majority of current machine learning infrastructure. Understanding why a given kernel is slow, or why a memory access pattern matters at the hardware level, requires understanding the machine. The audience for GPU architecture education is larger than it has ever been, and the existing materials have not kept pace with that demand.

Projects like this game, TinyGPU, and GPGPU-Sim represent different points on a spectrum from accessible to rigorous. A browser game with no setup barrier can reach developers who would never download a Verilog toolchain. If the game models the key tradeoffs faithfully, the intuitions it builds are the same intuitions that matter when writing CUDA kernels or debugging occupancy issues in a production workload.

The constraint-based design model, where you are given resources and must allocate them intelligently, is a good fit for teaching GPU architecture. The hardware itself was designed under tight area and power budgets, and the architectural decisions that make modern GPUs fast are direct consequences of those constraints. Building a simplified version, even in a browser, surfaces those consequences in a way that reading a whitepaper does not.