Learning GPU Architecture by Building One in Your Browser

Someone built a game called Mvidia where you construct a GPU, and it landed near the top of Hacker News with nearly 800 points. The creator’s stated reason was simple: GPU architecture resources are sparse. That observation is accurate, and the response the project received suggests a lot of people have felt the same absence.

The gap between CPU and GPU architecture education is substantial. Resources for learning how a CPU works are extensive. You can read Patterson and Hennessy, work through Nand2Tetris, implement a RISC-V core in a weekend using Verilog, or follow any number of university computer architecture courses posted online. The conceptual path from logic gates to a functioning processor is well-trodden and well-documented. GPU architecture has no equivalent on-ramp. NVIDIA publishes whitepapers for each generation, AMD publishes RDNA architecture guides, and researchers publish papers describing specific subsystems, but none of these constitute a coherent learning path. They assume you already know what a streaming multiprocessor is and why warp divergence is expensive.

What You Are Actually Building

A GPU is not just a wider CPU, and that misconception is part of what makes the architecture hard to learn from static descriptions. The fundamental organizational unit in a modern NVIDIA GPU is the Streaming Multiprocessor (SM), called a Compute Unit in AMD’s terminology. Each SM contains multiple CUDA cores, which are simple ALUs, arranged to operate in lockstep on a group of 32 threads called a warp. The warp is the scheduling primitive: the hardware does not schedule individual threads, it schedules warps, and all 32 threads in a warp execute the same instruction simultaneously.

This is SIMT, Single Instruction Multiple Threads, which differs from classical SIMD in that threads nominally have their own program counter and registers, but in practice converge to execute together. When threads in a warp take different branches, the hardware serializes the divergent paths and masks inactive threads, which is the root cause of warp divergence overhead.

Above the SM sits the memory hierarchy, which is the other major thing a GPU builder has to confront. Each SM has a register file that is much larger than what you find on a CPU: a modern NVIDIA SM has 256KB of register space, because registers are the mechanism for latency hiding. When a warp stalls waiting on memory, the warp scheduler switches to another warp without any OS-level context switch overhead. This technique, called latency hiding through warp switching, requires keeping many warps resident simultaneously, which requires keeping all their registers in hardware. Occupancy, the ratio of active warps to the maximum possible, is a central performance concept that flows directly from this constraint.

Shared memory sits at the same level as the L1 cache and is explicitly managed by the programmer. Global memory, what CUDA calls device memory, sits off-chip and is accessed through the L2 cache. The bandwidth and latency ratios between these levels are extreme compared to a CPU: high-end GPUs can sustain over 3 TB/s of VRAM bandwidth, but the latency for a global memory access is several hundred cycles. Writing GPU code that performs well means writing code that the hardware can schedule around those latencies, which requires understanding the hierarchy well enough to reason about access patterns.

Why a Game Works Here

The Nand2Tetris comparison is useful. That course works not because building a simple CPU is the most efficient way to learn digital logic, but because the building process creates a mental model that abstract descriptions cannot. When you have wired together a half-adder, a full adder, an ALU, and watched carry propagate through gates you placed yourself, the behavior of arithmetic hardware becomes concrete in a way that reading about it does not achieve.

The same principle applies to GPU architecture, but the stakes are higher because the architecture is more counterintuitive. CPU architecture roughly maps to how you think about sequential programs. GPU architecture requires a fundamentally different frame: you are writing a program that runs on thousands of threads, where the hardware’s ability to make progress depends on keeping those threads as convergent and memory-access-pattern-regular as possible. Simulation and game-based approaches are well-suited to building this intuition because they let you observe the consequences of architectural decisions directly.

There is a project on GitHub called tiny-gpu that takes a complementary approach: a minimal GPU implementation in SystemVerilog, small enough to trace through completely, designed explicitly for learning. It implements a simple shader pipeline with a warp scheduler, a register file, and a memory interface, and it synthesizes in simulation. Projects like that and Mvidia are attacking the same problem from different angles. The Verilog route gives you precise control and forces you to understand every wire, but requires familiarity with HDL toolchains. A browser game removes that barrier entirely.

The Pieces That Are Hard to Convey in Text

Warp scheduling is one of the things that benefits most from interactive exploration. The scheduler selects a warp to issue each cycle based on which warps have their operands ready. When you have many warps competing for issue slots, the scheduler can usually find one that is ready, hiding the latency of the ones that are waiting on memory. When you have few warps, because your kernel uses many registers or a large amount of shared memory, the scheduler has fewer options and the pipeline stalls. This relationship between resource usage and occupancy and throughput is straightforward to state but takes time to internalize.

Memory coalescing is another. When threads in a warp access global memory, the hardware combines those accesses into as few transactions as possible if the addresses are contiguous. Accessing memory with a stride of 32 elements instead of 1 means each thread’s access falls in a different cache line, the accesses cannot be coalesced, and bandwidth utilization collapses. The difference between coalesced and uncoalesced access on real hardware can be an order of magnitude in performance. Seeing that play out in a simulation, where you can deliberately break or restore the access pattern and observe the result, builds understanding faster than working through the math.

Bank conflicts in shared memory are a related case. Shared memory is divided into 32 banks, each 4 bytes wide. If multiple threads in a warp access the same bank simultaneously, the accesses serialize. A broadcast, where all threads read the same address, is fine. A conflict pattern, like all threads accessing addresses that map to the same bank, serializes into 32 sequential accesses. The hardware behavior is well-defined, but the rules are specific enough that most people need to encounter the pattern a few times before it becomes automatic to avoid.

Where This Fits

GPU programming has become more central to software in the past decade than anyone would have predicted in 2010. CUDA and ROCm underpin the compute infrastructure for machine learning. Vulkan and WebGPU have made shader programming accessible outside of the traditional graphics pipeline. Projects like WebGPU bring GPU compute to the browser with a coherent API that reflects modern GPU architecture more honestly than WebGL did. Understanding the hardware is no longer a specialty concern for graphics engineers; it is relevant to anyone running inference workloads, writing custom CUDA kernels for deep learning research, or trying to understand why their PyTorch code is slow.

The educational infrastructure has not kept pace with that shift in relevance. NVIDIA’s own documentation for CUDA programming is thorough but assumes you are already motivated and reasonably expert. Academic courses on GPU architecture exist but are not widely accessible. The GPGPU-Sim cycle-accurate simulator used in research is powerful but oriented toward PhD-level work, not introductory learning.

A game that puts the pieces of a GPU in your hands and asks you to assemble them, observe their behavior, and understand why they are designed as they are fills a real slot. The Mvidia project is one person’s attempt at that, built because the person wanted a resource like it and it did not exist. That is a reasonable motivation, and the reception it received on Hacker News suggests there is genuine demand for more of this kind of thing.