Learning GPU Architecture by Being Forced to Build One

The “build hardware from first principles” genre of programming education has produced some genuinely excellent work on the CPU side. nand2tetris takes you from a single NAND gate to a working computer complete with a compiler and operating system. Turing Complete on Steam walks you through logic gates, adders, ALUs, register files, and full ISA design in a puzzle game format that hundreds of thousands of players have completed. Neither of them touches the GPU.

The gap is not accidental. GPU architecture is structurally different from CPU architecture in ways that make it hard to teach incrementally. A CPU is a sequential machine with latency-hiding tricks bolted on; a GPU inverts that relationship entirely. It is a massively parallel machine designed to hide latency through sheer thread count, with sequential execution as the edge case. Explaining that inversion in text is straightforward enough, but building the intuition for it requires working through the tradeoffs in practice. mvidia, a browser game where you construct a GPU from component pieces, is a direct attempt to fill that gap. The creator was blunt about the motivation: “Thought the resources for GPU arch were lacking, so here we are.”

The CPU-Centric Default in Architecture Education

Most computer architecture education routes through the CPU. CPUs are the natural starting point because the concepts build sequentially: you construct an ALU, add registers, wire up a control unit, introduce pipelining, and the machine makes sense as a progression. Each concept has a clear predecessor.

The canonical texts on the GPU side, like Kirk and Hwu’s “Programming Massively Parallel Processors”, are excellent but approach the hardware from the programming model outward. You learn about warps and shared memory because you are writing CUDA code and need to understand why your kernel is slow. The architecture serves as backdrop for the programming, not as the subject itself. NVIDIA’s own architecture whitepapers, including the Ampere Architecture Technical Brief, are thorough but dense, written for hardware engineers rather than developers approaching the concepts for the first time.

The closest equivalent to a “build from scratch” resource on the GPU side is GPGPU-Sim, an academic microarchitecture simulator used in graduate courses. It models NVIDIA-like SM pipelines, warp schedulers, and memory systems at a cycle-accurate level. It is also a C++ research tool with a multi-thousand-line configuration file, not something you open on a Sunday afternoon.

What Makes GPU Architecture Hard to Teach Interactively

The core challenge is that the key GPU design decisions are responses to a latency problem that does not manifest at small scales. The insight driving everything else in GPU design is this: DRAM latency runs around 600 cycles, but if you have hundreds of independent threads in flight, you can fill those cycles with useful work from other threads. The GPU’s entire hierarchy, from warp scheduling to register file sizing to the structure of shared memory, follows from that single observation.

In a CPU, you build a cache and an out-of-order execution engine to minimize stall cycles. In a GPU, you accept the stall and schedule around it. This is called latency hiding, and it works only if you have enough independent threads to keep execution units busy while others wait on memory. A streaming multiprocessor (SM) on an NVIDIA Ampere GPU can hold up to 1,536 concurrent threads organized into 48 warps of 32 threads each. Four warp schedulers pick a ready warp on every clock cycle. When one warp stalls waiting on a memory load that will not return for 600 cycles, the scheduler moves to the next ready warp immediately, with zero overhead, because all warp state lives permanently in the register file.

That zero-overhead context switch is why the register file on a single Ampere SM is 256 KB, roughly 64 times larger than what you would find in a modern CPU core. It is not sized for performance in the conventional sense; it is the physical implementation of the latency-hiding strategy. Every active warp needs its full register state available simultaneously.

Explaining this in text produces understanding. Building a scheduler that has to allocate register file space across competing warps produces intuition.

The Warp Execution Model as a Design Puzzle

The SIMT (Single Instruction, Multiple Threads) execution model is where GPU architecture becomes genuinely counterintuitive. A warp is a group of 32 threads that execute the same instruction simultaneously, but each thread operates on independent data and maintains its own registers. The hardware treats them as a unit; the programmer addresses them as individuals.

The divergence problem follows directly. When threads in a warp reach a conditional branch and take different paths, the hardware cannot execute both paths simultaneously. It runs one path with the other threads masked off, then runs the second path with the first group masked. Both paths execute serially, and you pay the cost of both:

if (threadIdx.x < 16) {
    do_work_a();  // threads 0-15 execute, 16-31 masked
} else {
    do_work_b();  // threads 16-31 execute, 0-15 masked
}
// total time: time(do_work_a) + time(do_work_b)

With a fully uniform warp taking the same path, you pay once. Warp divergence is not a subtle optimization concern; it is a primary correctness-of-reasoning issue when writing GPU code.

Starting with Volta in 2017, NVIDIA introduced independent thread scheduling, where each thread maintains its own program counter rather than sharing one per warp. This allows finer-grained interleaving that eliminates some divergence penalties, but it also introduced correctness changes for code that relied on implicit warp-synchronous behavior, as the Volta migration guide documents in some detail.

Memory Coalescing and the Cost of Scattered Access

The second concept that benefits most from hands-on construction is memory coalescing. GPU global memory is accessed in transactions of 32 to 128 bytes. When the 32 threads of a warp each request a 4-byte value at consecutive addresses, the hardware services the entire warp with a single 128-byte transaction. When those 32 threads request values at scattered or strided addresses, each may require a separate transaction, multiplying memory traffic by up to 32x.

The practical implication is that access patterns which seem equivalent in sequential code have wildly different performance characteristics on a GPU. A row-major traversal of a 2D array can be either fully coalesced or completely uncoalesced depending on which dimension the threads are indexed along, and the performance difference on a bandwidth-bound kernel can be an order of magnitude or more.

Sasha Rush’s GPU Puzzles, released in 2022, takes a different pedagogical angle on this problem. It is a series of CUDA-in-Python exercises built on a custom simulator that forces you to reason about thread indexing and memory access patterns to produce correct outputs. It is excellent for programming intuition but does not go below the CUDA abstraction layer into the hardware that makes those patterns matter.

mvidia goes a layer deeper. Putting players in the position of constructing the memory controller and observing how access patterns translate into transaction counts makes the coalescing behavior visible rather than treating it as an emergent property of code that happens to run slowly.

The Broader Context

GPU architecture knowledge has moved from a specialty domain to a broadly relevant skill set. The hardware that runs large language model training and inference is the same hardware that runs games; the fundamental architectural concepts are the same whether you are optimizing a CUDA kernel for transformer attention or a graphics pipeline for shadow mapping. Engineers who understand warp occupancy, register pressure, and memory bandwidth constraints make meaningfully better hardware utilization decisions than those who treat the GPU as an opaque accelerator.

The gap in interactive architecture education for GPUs has persisted partly because GPU architecture is more complex than CPU architecture at the entry level, and partly because the historically dominant use case was graphics programming, where the abstractions in OpenGL and DirectX were designed to hide the hardware. CUDA shifted that by exposing the programming model more directly, and the growth of GPU compute for machine learning has created a large new audience for this material.

A browser game is a sensible delivery mechanism for a first exposure. The interactive hardware simulators and textbooks serve audiences that already know enough to use them. Something you can open without installing anything, that teaches by making you construct the machine rather than read about it, serves a different and larger audience. The project is on Hacker News with nearly 800 points, and the comment thread covers comparisons to Turing Complete, questions about what gets simplified versus modeled accurately, and suggestions for what to add next. The creator’s framing as a response to inadequate resources seems accurate. GPU Puzzles, GPGPU-Sim, and the NVIDIA whitepapers are all good, but they assume different starting points. mvidia is aimed at the beginning of the learning curve, which is where most people who need this material are starting.