· 7 min read ·

Learning GPU Architecture Through a Game That Makes Warp Divergence Hurt

Source: hackernews

Most programmers learn that GPUs are fast. Fewer learn why GPUs are fast in the specific situations they are fast, and why they fall apart in others. That gap exists because the resources for GPU microarchitecture education are either deeply academic (cycle-accurate simulators requiring days of setup) or too abstract (diagrams that show boxes labeled “SM” without conveying the cost of a divergent branch).

A project called mvidia takes a different approach: it puts you inside a simplified GPU and asks you to manage it. The name is a deliberate lowercase jab at NVIDIA, and the game runs entirely in the browser. The author’s stated reason for building it was that the resources for GPU architecture were lacking, and after spending time with it, that framing is accurate.

What the Game Actually Teaches

The core of mvidia is the Streaming Multiprocessor (SM), the fundamental compute unit of an NVIDIA-style GPU. A modern H100 has 132 SMs; each one is itself a parallel processor with multiple warp schedulers, a large register file, shared memory, and hundreds of shader cores. The game renders a stylized SM and asks you to watch, and in some cases influence, how warps execute through it.

A warp is a group of 32 threads that execute in lockstep. This is NVIDIA’s SIMT model: Single Instruction, Multiple Threads. Every thread in a warp executes the same instruction every cycle, but each operates on its own data and has its own register state. The hardware executes all 32 threads simultaneously, which is where the throughput comes from.

The problem the game focuses on is warp divergence. When threads within a warp hit a conditional branch and evaluate it differently, some threads need to take the true branch and some need to take the false branch. The hardware cannot do both at once. It serializes them: first the true-path threads execute with the false-path threads masked off (disabled but still occupying scheduler slots), then the false-path threads execute. A warp split cleanly in half runs at half throughput for that code region. A warp where 31 threads take one branch and 1 thread takes another runs at 1/32nd throughput for the minority path.

mvidia makes this visible and visceral in a way that a paragraph in a textbook never quite does. The inactive threads literally go gray. You can see the utilization drop.

SIMT vs. SIMD: The Distinction That Gets Papered Over

CPU programmers are often told that GPUs are “just SIMD at scale.” This is approximately true and precisely wrong in the way that matters.

SIMD on a CPU is explicit. You operate on a 256-bit AVX register containing eight 32-bit floats, and every element gets the same operation. Divergence is your problem to manage; the hardware does not handle it for you. The programmer (or compiler) must explicitly predicate instructions or restructure the code to avoid branching within a vector.

SIMT is implicit. You write scalar-looking code, and the hardware groups your threads into warps and handles the divergence automatically via predicate masking. The abstraction is higher. The cost is that the hardware is performing that masking for you, and if your code diverges heavily, the efficiency penalty is paid silently unless you profile.

In Volta and later NVIDIA architectures, NVIDIA added Independent Thread Scheduling, which gives each thread in a warp its own program counter. This allows the scheduler to interleave instructions from divergent branches at the instruction level rather than serializing whole paths. It helps in some cases but does not eliminate the fundamental throughput cost; it primarily avoids deadlocks in programs that were relying on implicit warp-level synchronization.

AMD’s equivalent to the warp is the wavefront, and it is 64 threads wide rather than 32. This doubles the potential divergence penalty but also doubles the theoretical SIMD efficiency on non-divergent code. It is a genuine architectural tradeoff, not just a different constant.

Memory: Where Most Real Performance Goes

Warp divergence gets more attention in introductory material, but memory access patterns are responsible for a larger fraction of real-world GPU performance problems.

The SM’s memory hierarchy has roughly these latency characteristics on Ampere-class hardware:

  • Registers: effectively zero cycles (operand forwarding)
  • L1 cache / shared memory: approximately 28 cycles (these share a unified 128 KB pool on Ampere, configurable split)
  • L2 cache: approximately 200 cycles (40 MB on H100)
  • Global memory (HBM): approximately 600 cycles

When 32 threads in a warp each load a 4-byte float from consecutive aligned addresses, the hardware coalesces all 32 accesses into a single 128-byte cache line transaction. That is the happy path. When the same 32 threads access scattered or strided addresses, the hardware generates up to 32 separate transactions, consuming proportionally more memory bandwidth and pipeline slots.

Shared memory, the 16–32 KB (or up to 100 KB in configured split) of on-SM programmable scratchpad, avoids this. It is explicitly managed by the programmer via __shared__ in CUDA. It is fast, but it has its own hazard: 32 memory banks matching the warp width. Two threads in a warp accessing different addresses in the same bank cause a bank conflict, which serializes those accesses. A warp that hits 32-way bank conflicts in shared memory runs at 1/32nd the bandwidth of a conflict-free access pattern.

mvidia reportedly visualizes these access patterns, showing the difference between a coalesced global load and scattered accesses. This kind of visualization is genuinely hard to find elsewhere.

The Educational Landscape Before This

GPGPU-Sim, developed at UBC, is the most rigorous open-source GPU simulator. It models the full SM pipeline, warp scheduling, memory hierarchy, and interconnect at cycle accuracy. It is used in computer architecture research and can simulate specific GPU generations. It is not something you open in a browser in twenty minutes. Setup involves building from source, configuring a GPU model, and interpreting dense trace logs.

Accel-Sim, also from UBC, extended GPGPU-Sim to use real hardware traces collected via NVIDIA’s NVBit instrumentation framework. It achieves better accuracy but adds more infrastructure.

MGPUSim from Northeastern simulates AMD GCN architecture in Go, with a more modular design. Again, not a browser game.

The practical tool that shows real GPU behavior is NVIDIA Nsight Compute. It profiles actual hardware and surfaces warp occupancy, memory throughput, pipe utilization, and stall reasons with remarkable detail. But it requires an NVIDIA GPU, requires CUDA code to profile, and presents its output in a profiler interface that is itself a learning curve.

Academic courses like Stanford CS149 and CMU 15-418 cover GPU architecture well in slide decks, but slides are static.

Kirk and Hwu’s Programming Massively Parallel Processors remains the standard textbook. The fourth edition covers CUDA 11, the memory hierarchy, and warp execution in depth. It is thorough and worth reading. It is also a textbook.

The pedagogical model that mvidia most resembles is nand2tetris, the project that teaches computer architecture by having you build a computer from NAND gates up to a working assembler. nand2tetris works because the cost of wrong decisions is immediate and concrete. mvidia applies that philosophy to the specific domain of GPU execution.

Why the Abstraction Level Matters

GPU performance optimization is unusual in that the most impactful decisions are often invisible to the programmer in the absence of specific knowledge. Writing arr[threadIdx.x] versus arr[threadIdx.x * 2] looks like a minor change. The first coalesces; the second does not, and may halve your memory bandwidth. Writing an if/else inside a kernel looks harmless. If that branch diverges across threads in a warp, your throughput for that section drops proportionally to how many paths the warp must serialize.

Nsight Compute will tell you, after the fact, that your warp efficiency was 43% and your global load transactions per request were 8.2. Understanding why requires a mental model of the SM that most programmers have not built.

Simulators like GPGPU-Sim build that model, but the setup cost filters out everyone except researchers. Textbooks build it, but passively. mvidia builds it through the execution model that games have always used: show consequences immediately, let the player develop intuition through iteration.

The Hacker News thread for the project collected 161 comments and 788 points, which for a solo educational project is a signal that the gap it fills is real. GPU architecture sits at an awkward intersection: too low-level for most application developers to care about until they hit a performance wall, too specific for general systems programming resources to cover in depth.

What It Does Not Cover

A game that runs in the browser will necessarily simplify. Modern GPU microarchitecture includes tensor cores for matrix multiply-accumulate operations at FP16/BF16/FP8/INT8 precision, which are central to any serious ML workload. The register file pressure model, where each thread’s register count directly constrains how many warps can be resident on an SM simultaneously (trading occupancy against register spilling), is a subtlety that matters enormously in practice. The L2 cache interconnect, the NVLink fabric, and the copy engine for host-device transfers are all upstream of the SM.

mvidia is also NVIDIA-shaped. AMD’s RDNA3 architecture uses 64-thread wavefronts but can also operate in WGP mode with 32-thread sub-groups. Intel’s Xe architecture organizes execution differently again, with Execution Units grouped into Xe-cores. The concepts transfer, but the constants and the specific tradeoffs differ.

None of that is a criticism of the game. A resource that teaches warp divergence and memory coalescing well has provided real value, because those two concepts account for a large fraction of GPU performance problems in practice. Building on a solid SIMT mental model is far easier than starting from scratch when Nsight Compute tells you your warp efficiency is poor.

The project is available to play in the browser. For anyone writing GPU kernels, or planning to, spending an hour with it before opening the CUDA Programming Guide is a reasonable order of operations.

Was this interesting?