A developer recently documented building a fully functional open-world streaming engine for real N64 hardware. The project runs on actual cartridges at 30 fps with continuous world streaming via PI DMA, without load screens or room transitions. No commercial N64 title shipped anything like this, and understanding both why not and what it took to build in 2024 requires working through the hardware constraints at a level of specificity that makes the architecture self-evident.
What the Hardware Actually Allows
The N64’s hardware distributes its work across three processing units, and the exact numbers matter for every design decision downstream.
The CPU is an NEC VR4300, a MIPS R4300i derivative running at 93.75 MHz with 24 KB of instruction cache and 8 KB of data cache. Effective throughput under real workloads is 50 to 60 MIPS. The rendering work happens on the Reality Co-Processor (RCP), a custom SGI chip split into the RSP (Reality Signal Processor) and the RDP (Reality Display Processor).
The RSP executes microcode at 62.5 MHz and has exactly 4 KB of IMEM (instruction memory) and 4 KB of DMEM (data scratchpad), both on-die SRAM with no cache and no virtual memory. The RSP cannot access main RDRAM during execution; everything it needs must be DMA-loaded into DMEM before microcode begins. The standard F3DEX2 microcode fits in 4 KB of IMEM and holds 32 transformed vertices in DMEM simultaneously. That 32-vertex ceiling is the fundamental unit of all N64 geometry submission, and every mesh authoring and display list ordering decision follows from it.
The RDP handles rasterization, perspective-correct texturing, Z-buffering, and alpha blending. Its texture memory, TMEM, is 4 KB total, not per-texture. A 32x32 RGBA16 texture consumes exactly 2 KB, so two textures fill the entire budget. Any texture change mid-frame triggers a TMEM reload and a SYNC_TILE command, stalling the RDP pipeline.
Main memory is Rambus DRAM: 4.5 MB on the base system, expandable to 8 MB with the Expansion Pak. After double-buffered framebuffers (roughly 300 KB each at 320x240 16bpp), the Z-buffer (~150 KB), OS overhead, audio buffers, and display list working memory, a typical base system has roughly 2 to 2.8 MB available for game data. After allocating textures and game state, the streaming geometry budget can fall below 500 KB of RDRAM.
Streaming Through a 15 MB/s Bus with No Filesystem
Cartridge data arrives over the Peripheral Interface (PI) bus at a practical sustained rate of 12 to 16 MB/s. PI DMA is asynchronous: the CPU queues a transfer and continues executing. There is no filesystem on a cartridge, just a flat ROM image with offsets established at compile time. Loading a chunk of world geometry means knowing its ROM address from a table generated by the build system and issuing a DMA request against that offset.
At 15 MB/s, a 64 KB chunk takes roughly 4 milliseconds to transfer. A frame at 30 fps is 33 milliseconds. That budget fits, but only if the CPU does not wait on the DMA result before building the current frame’s display list. The entire streaming architecture depends on prefetch: DMA requests go out one or two frames before the geometry will render, the CPU assembles display lists from already-resident data, and a chunk lifecycle manager tracks which RDRAM regions are currently live.
ROM layout becomes a first-class build system concern. Chunks stored sequentially on ROM can be read at maximum PI bus speed in linear burst mode. Scattered ROM accesses drop significantly below that, because each DMA transaction carries per-request overhead regardless of transfer size. Factor 5’s work on Rogue Squadron understood this; many commercial N64 teams using Nintendo’s standard SGI-based SDK did not have documentation or mandate to care.
There is also a subtle cache coherency trap that catches N64 homebrew developers. The PI DMA controller does not invalidate the CPU’s data cache after a transfer completes. Code that reads DMA-loaded geometry without explicitly invalidating the cache reads stale data from before the transfer. The bug is silent: values look plausible and are simply wrong. This is the kind of detail absent from the commercial SDK documentation and documented in n64brew.dev only through accumulated community reverse-engineering.
The 4 KB That Drives Every Texture Decision
TMEM is the binding constraint on N64 terrain rendering. A naive open-world approach loads unique textures per chunk, so as the camera crosses a chunk boundary, the display list for the incoming chunk references different textures, triggering TMEM reloads on every draw call. The stalls accumulate through the RSP and RDP pipeline stages, and frame time collapses.
The correct architecture keeps a fixed set of small tileable base textures permanently resident in TMEM throughout the scene. A 32x32 CI4 texture (4-bit indexed color, 16-color palette) costs 512 bytes plus 32 bytes for the palette. Four terrain types (grass, rock, dirt, sand) fit together in TMEM with room to spare. The RDP’s two-cycle combiner mode, which evaluates (A - B) * C + D in two passes, lets the shading pipeline blend between two surface types using vertex color alpha, giving smooth biome transitions without a single mid-scene TMEM reload.
In this arrangement, geometry streams per chunk while textures remain fixed in TMEM across the entire scene. Display lists for every chunk reference the same TMEM slots, and terrain draws without a pipeline stall between chunks. This is the design you arrive at by treating TMEM as a fixed global resource shared across all draw calls, rather than a per-draw-call buffer that can be freely reloaded.
The same logic applies to draw call ordering more broadly. Each RSP gsSPVertex command DMA-loads up to 32 vertices into DMEM and transforms them against the current matrix stack. Triangle commands then reference vertices by index within that 32-entry buffer. Geometry that exceeds 32 unique vertices per draw call requires multiple vertex loads, and each load flushes the buffer. Mesh authoring and display list construction have to account for this at asset export time, grouping triangles to maximize vertex reuse within each 32-vertex window.
Why Commercial Studios Chose Fog and Load Screens
The commercial N64 library produced no continuously streaming open world. Banjo-Tooie’s background DMA for adjacent room sections is the closest example; Turok uses a portal-and-fog cell renderer with hard distance limits. Ocarina of Time’s Hyrule Field, often cited as a large connected environment, loads the entire field mesh as a single scene. The fog distance is calibrated precisely so that geometry beyond the loaded region is never visible, which means the fog is doing real work as a rendering budget mechanism, not just as atmosphere.
The decision to use fog, room transitions, and discrete load screens was rational given the commercial context of 1996 to 2001. Development teams worked on SGI Indy and O2 workstations with proprietary SDK documentation under NDA. Much of what developers now know about N64 hardware was reverse-engineered after the console’s commercial lifetime ended. The engineering resources required to solve continuous streaming, when a discrete loading approach could ship a game on schedule, were not available on most projects.
Factor 5 is the exception because they had a product-level reason to solve it. Rogue Squadron required streaming terrain, so they built a quadtree LOD system, optimized ROM layout for sequential DMA, and shipped 60 fps with mid-level terrain streaming on the same hardware that ran Zelda OoT at 20 fps without any streaming. The hardware capability was identical; Factor 5 committed the engineering resources to solve the problem and most commercial N64 studios did not have the margin to do the same.
What Modern Homebrew Tooling Changes
The developer building this engine had infrastructure that original N64 studios never did. Libdragon, the open-source N64 SDK maintained through 2025, provides dma_read_async() with interrupt-based completion signaling, an rdpq subsystem that manages display list construction and TMEM synchronization automatically, and a GCC 12+ MIPS cross-compilation pipeline with USB-based debug output via UNFLoader.
The rspq system in libdragon handles RSP microcode overlays, switching between geometry and audio microcode dynamically while sharing the same 4 KB IMEM. T3D (Tiny3D), built on top of libdragon, adds scene graph management, skeletal animation via RSP, LOD, and fog as first-class features rather than manual display list construction.
F3DEX3, a community microcode released in 2023, adds per-vertex ambient occlusion baked into vertex alpha at zero per-frame rendering cost. Commercial N64 developers had no equivalent; ambient lighting on N64 was a manual art process baked as static vertex colors in modeling tools. A developer building a homebrew engine in 2024 with F3DEX3 has a shading capability that Rare, Factor 5, and Nintendo’s first-party teams could not access.
The hardware documentation gap has closed substantially through projects like n64brew.dev, the SM64 decompilation, and reverse-engineering of Ocarina of Time’s scene and room formats documented at zelda64.dev. A developer building on the N64 today has more complete hardware documentation than Nintendo’s own licensees had in 1997.
Building With the Constraints
Building a streaming engine on the N64 makes every abstraction explicit. There is no allocation manager, no texture atlas generator, no LOD middleware that handles things transparently. The 32-vertex DMEM cache, the 4 KB TMEM ceiling, the 2.5 MB working RAM budget, the 15 MB/s PI bus rate: each becomes a named term in the equations that determine chunk size, mesh density, texture format, and DMA prefetch window.
The result is an engine where every architectural decision traces directly to a hardware register or a memory boundary. This is what optimization means before it becomes a profiler annotation in a modern engine. The commercial N64 era did not produce a continuously streaming open-world engine partly because the schedule pressure of shipping games did not require it, and partly because the tooling and documentation required to build one cleanly did not exist. A developer working in 2024 with libdragon, F3DEX3, and three decades of community reverse-engineering behind them is working in a categorically different environment, on hardware that ceased commercial production twenty years ago and is now fully understood.