Open-World Design on N64 Hardware Is a Systems Programming Problem

Someone built an open-world engine for the Nintendo 64, documented the process, and put it on YouTube. The feat is worth examining not as a curiosity but as a systems engineering case study. The N64’s hardware is specific enough that its constraints force particular solutions, and those solutions illuminate design principles that stay relevant well beyond this console.

The Hardware You Are Actually Fighting

The N64’s architecture splits rendering work between the CPU and the RCP (Reality Co-Processor). The CPU is a MIPS R4300i running at 93.75 MHz, with 16 KB instruction cache and 8 KB data cache. The RCP contains two distinct processors on one die: the RSP (Reality Signal Processor), which handles vertex transforms, lighting, and audio mixing, and the RDP (Reality Display Processor), which rasterizes triangles.

The RSP is the strangest part of the system for anyone coming from modern platforms. It has no cache, no virtual memory, and no general-purpose registers in the traditional sense. It operates from a 4 KB data scratchpad (DMEM) and a 4 KB instruction scratchpad (IMEM). The entire RSP microcode, including the geometry transformation pipeline, must fit within those 4 KB of instruction memory. Nintendo shipped several official microcode variants, including F3D (Fast 3D) and the later F3DEX2, each making different trade-offs between feature set and throughput. The libdragon project has since produced open-source RSP microcode, including the Tiny3D renderer, which gives modern homebrew developers a viable alternative to Nintendo’s proprietary libraries.

The RDP’s constraint is the one that shapes texture decisions most directly. Its texture memory, TMEM, is 4 KB total. A single 32×32 RGBA16 texture occupies 2 KB, meaning you get exactly two of them before TMEM is full. The practical response is to use smaller tiles (16×16 or 8×8), indexed color formats (CI4 with 16-color palettes uses one-eighth the memory of RGBA16), or texture atlases with sub-tile addressing via the RDP’s tile descriptor system. Every texture decision in an N64 engine is downstream of this 4 KB ceiling.

Main RAM is 4 MB of RDRAM, expandable to 8 MB with the Expansion Pak. RDRAM has high latency for random access (roughly 640 ns) but good burst throughput for sequential reads. This makes spatial locality in memory layout much more important than on PC platforms, where cache hierarchies and prefetchers hide random access costs. Scattering small allocations across the address space on the N64 produces measurable performance penalties.

The Streaming Problem

An open world requires loading terrain, geometry, textures, and collision data as the player moves. On the N64, all of that data lives on the cartridge ROM. Nothing executes from the cartridge directly; everything must be DMA’d into RDRAM first via the Peripheral Interface (PI) bus, which provides roughly 20 to 40 MB per second sustained read throughput.

The latency for the first byte off a cartridge is around 1 microsecond, after which burst transfer proceeds at the PI bus rate. For sequential large reads this is workable. For many small random reads it is not, which means world data must be laid out in ROM so that everything needed to load a given chunk is contiguous.

The standard approach is to divide the world into fixed-size chunks, DMA them asynchronously (using the N64’s DMA interrupt mechanism rather than spinning the CPU in a wait loop), and evict the least-recently-needed chunks when a new chunk must be loaded. The double-buffering pattern is common: one chunk loads in the background while the currently resident set is being rendered, with the game loop never stalling on cartridge reads. Prefetching based on player velocity direction gives the async load time to complete before the player actually reaches the new region.

With 4 MB total RDRAM, the effective budget after accounting for frame buffers and the Z-buffer is roughly 2 to 2.5 MB for all game data simultaneously resident. Frame buffers for 320×240 at 16 bits per pixel cost about 300 KB each; the Z-buffer costs another 300 KB. The OS, audio buffers, RSP task buffers, and stack eat another few hundred kilobytes. What remains must hold every piece of the world currently needed: visible geometry, textures, collision meshes, actor state, and code overlays.

Commercial N64 games solved this with room-based architectures. Ocarina of Time uses a scene and room system where the overworld is a single large scene but interior areas are subdivided into individual room files, each DMA’d on entry. Actor code loads as relocatable overlays, swapped into a fixed RDRAM region on demand. Only the actors present in the current room are resident. An open-world engine without discrete rooms must solve the same problem without that architectural boundary, which makes the streaming system the central engineering challenge.

Display Lists and the CPU/RSP/RDP Pipeline

Rendering on the N64 works by having the CPU generate a display list, a sequence of 64-bit command words in RDRAM, then submitting an RSP task that points to that display list. The RSP executes the microcode, which interprets the display list, transforms vertices, and emits RDP commands. The RDP rasterizes and writes pixels to the framebuffer.

The performance model requires keeping all three stages busy simultaneously. The CPU builds the next frame’s display list while the RSP processes the current one while the RDP rasterizes the previous one. Stalling any stage propagates through the pipeline.

For an open-world engine, the display list architecture has to handle a variable and potentially large number of visible objects. Pre-baking static geometry into ROM-stored display lists and calling them from a master display list is a common approach: the CPU DMA’s a static display list into RDRAM, then emits a call instruction referencing it, rather than re-emitting every vertex command each frame. This moves the bandwidth cost to the DMA operation and keeps per-frame CPU work proportional to scene complexity rather than raw triangle count.

The RSP’s vertex cache (typically 32 vertices in F3DEX2) means display lists must be organized to reuse vertices across triangles without exceeding the cache size, or performance drops sharply. Indexed triangle strips help, but the 32-vertex limit on the cache shapes how meshes are authored and packed into ROM.

What Homebrew Engines Expose That Commercial Games Did Not

Commercial N64 titles solved these problems implicitly, embedded in studio toolchains and proprietary formats that were never public. Reverse engineering projects like Zelda64 and the SM64 decompilation have reconstructed much of this, but the documentation is fragmented across wikis and Discord servers.

Building a homebrew open-world engine from scratch forces explicit answers to questions that Nintendo’s internal tools handled automatically. How large should each world chunk be, given the PI bus rate and the desired load time budget? How many texture bytes can be allocated per chunk before TMEM pressure causes visible pop-in as new tiles are scheduled? What is the maximum display list size in bytes before the CPU’s DL generation budget exceeds one frame at 30 Hz?

These are not abstract questions. They have numerical answers that fall directly out of the hardware spec, and getting them wrong produces frame drops or visual artifacts on real hardware that emulators may not faithfully reproduce.

The libdragon ecosystem has made modern N64 homebrew substantially more tractable. Its rdpq API handles TMEM tile management and combiner configuration at a higher abstraction level than raw display list commands. The Tiny3D (t3d) engine built on top of libdragon provides a model format, matrix stack, and skinned animation pipeline, reducing the infrastructure work required to reach a rendered frame. USB-based flashcart loaders like UNFLoader allow running homebrew on real hardware with printf debugging over USB, which closes the gap between emulator behavior and physical hardware behavior significantly.

Why This Matters Outside N64

The N64 case is extreme, but the underlying problems are not unique to it. Streaming geometry from slow storage into limited memory while maintaining frame rate is a problem that modern open-world engines on consoles with fast NVMe SSDs still grapple with, just at different scales. The N64’s severity of constraint makes the solution structure visible in a way that is obscured when you have gigabytes to work with.

The scratchpad model of the RSP, where all working data must fit in 4 KB with explicit transfers, anticipates the architecture of modern GPU shared memory and SIMD units. The lesson that random access patterns are expensive relative to sequential burst transfers holds on every memory hierarchy that has ever shipped. The constraint that display lists must be structured to respect a small vertex cache maps directly to index buffer optimization for modern GPU vertex caches.

Building on hardware with no slack forces precision about what the machine is actually doing. The N64 homebrew scene produces that precision repeatedly, and projects like this open-world engine make the resulting knowledge concrete and navigable.