· 10 min read ·

Streaming an Open World Through 4MB: What the N64 Homebrew Scene Finally Got Right

Source: lobsters

The constraints are not subtle. The Nintendo 64 ships with 4.5 megabytes of total system RAM, a texture cache measuring 4 kilobytes, and a cartridge bus that delivers around 15 MB/s of sustained throughput under favorable conditions. Commercial studios in the late 1990s looked at those numbers and reached the same conclusion: you do not build a continuous streaming open world on this hardware; you build the illusion of one.

A recent project documented on YouTube does something different. A homebrew developer built a genuine open-world streaming engine targeting real N64 hardware, and the choices involved reveal a great deal about what the platform actually costs and why the modern development ecosystem changes the calculus for independent builders today.

The Hardware You Are Actually Working With

Before discussing solutions, the machine itself needs to be understood precisely.

The N64’s CPU is a NEC VR4300, a 64-bit MIPS derivative running at 93.75 MHz. The 64-bit ISA designation is misleading for performance estimates; the internal data path is 32 bits wide, and the practical throughput lands closer to 50 to 60 MIPS under real workload conditions. The CPU has 24KB of instruction cache and 8KB of data cache. Cache misses to RDRAM are expensive. Cache misses during a DMA transfer are worse, because the DMA controller does not automatically invalidate the CPU’s cache. Any code that reads DMA-loaded data without first calling a cache invalidation routine will silently read stale values. This is among the most common sources of bugs in N64 homebrew.

The memory system uses Rambus DRAM, which was an unusual choice in 1996. RDRAM delivers high peak bandwidth, around 500 MB/s sustained, but incurs high per-transaction latency. Sequential large-block transfers are efficient. Pointer-chasing through linked lists and trees is slow. This shapes every data structure decision in a streaming engine.

The graphics subsystem is the Reality Coprocessor, split into two processors.

The RSP (Reality Signal Processor) runs at 62.5 MHz and has 4KB of instruction memory and 4KB of data memory. Both are hard limits. The RSP executes microcode uploaded by the CPU via SP DMA; the official SGI microcode handles geometry transformation and lighting. Custom microcodes are possible and some homebrew projects exploit this for specialized decompression or procedural geometry generation on the co-processor.

The RDP (Reality Display Processor) is the rasterizer. It renders triangles with perspective-correct texturing, bilinear filtering, Z-buffering, and alpha blending. Its peak fill rate is around 30 million pixels per second, which sounds adequate until you consider that a 320x240 framebuffer at 30 frames per second requires rendering 2.3 million pixels per frame, and any overdraw multiplies that budget directly. The RDP’s texture memory (TMEM) is exactly 4 kilobytes. That is not a per-texture limit; it is the total capacity for all textures in use during a draw call. Switching textures mid-frame causes TMEM reloads, and each reload costs rasterization time. Minimizing TMEM churn is among the highest-leverage optimizations available on this hardware.

The cartridge interface runs through the PI (Peripheral Interface) bus at a peak of roughly 20 MB/s and a sustained practical rate closer to 15 MB/s. This is the only pipeline for streaming world data from ROM to RAM. The PI DMA is asynchronous: the CPU queues a transfer and continues executing. The RSP, RDP, and PI DMA all share the memory bus; contention between them is real and measurable in frame time.

What Commercial Studios Did Instead

The original N64 open-world titles are case studies in constraint navigation, and it is worth being precise about what they actually do.

Ocarina of Time (1998) divides the world into scenes and rooms. Hyrule Field is one scene, loaded entirely into RAM as a single mesh when you enter it. There are no additional loads while traversing the field; the entire geometry and texture set fits in memory simultaneously. Transitions to other areas trigger discrete loading sequences. The fog draw distance is set close enough to ensure that no geometry beyond the loaded region is ever visible. The terrain textures are small tiles, 32x32 pixels or smaller, repeated across large surfaces. This keeps the TMEM footprint low and allows the same texture data to cover enormous areas without straining the ROM budget. Extensive reverse-engineering work on the scene and room format confirms this architecture in detail.

Turok: Dinosaur Hunter (1997) used a portal and cell rendering system. The world is divided into cells connected by portals; only cells visible through the current chain of portals are submitted to the RDP. The fog terminates at an extremely short distance by modern standards, effectively hard-clipping geometry before it becomes a rendering concern. The game’s ROM is only 8 megabytes, reflecting how aggressively the geometry and texture data were compressed and how little unique content each cell contains.

GoldenEye 007 (1997) used BSP tree partitioning with precomputed potentially visible sets. Indoor geometry sorts well into this structure. The approach works for enclosed spaces; it becomes impractical for large outdoor areas without significant augmentation.

None of these represent open-world streaming in the sense of continuously loading and unloading chunks of a persistent world as the player moves. They are efficient partitioning schemes applied to finite, hand-crafted geometry sets that fit in RAM all at once.

What a Real Streaming Engine Requires

A genuine streaming open-world engine on N64 has to solve four problems simultaneously, and they interact with each other.

RAM budget. With 4MB of RDRAM total, a practical allocation might reserve 300KB for two framebuffers at 320x240x16bpp, 150KB for the Z-buffer, 64KB for audio buffers, 64KB for display list working memory, and around 32KB for the OS and stack. That leaves roughly 3.3MB for world geometry, textures, and engine state. A 3x3 grid of world chunks around the player, each containing terrain geometry and texture data, needs to fit within this budget while leaving headroom for dynamic objects.

TMEM management. With only 4KB of texture memory on the RDP, large open-world terrain is essentially incompatible with unique per-tile texturing. The solution is texture atlasing: pack all terrain textures into a single atlas image and load it into TMEM in tiles using the RDP’s sub-rectangle load instructions. The RDP supports loading an arbitrary rectangle from a larger source texture into a specific region of TMEM. An atlas of 64x64 RGBA16 pixels occupies 8KB, which is two TMEM loads but allows the entire terrain to share one set of texture data with no per-draw-call TMEM reload. The libdragon rdpq documentation covers this tiling mechanism in the TMEM layout section.

Cartridge streaming. Loading a new chunk from cartridge ROM takes measurable time. A 64x64 terrain chunk at 2 bytes per height sample, plus texture coordinate and normal data, might total 50 to 100KB. At 15 MB/s PI DMA throughput, that is 3 to 7 milliseconds per chunk load. At 30 fps each frame is 33 milliseconds, so loading more than a handful of chunks synchronously per frame is not viable. The PI DMA must run asynchronously while the RSP and RDP continue processing the current frame’s geometry. This requires double-buffering the DMA target regions: while the engine renders from one set of chunk buffers, the PI DMA fills a staging area with the next set. libdragon exposes this via dma_read_async(), with completion signaled through an interrupt handler.

Frustum culling. Only the chunks within the camera’s view frustum need to be submitted to the RSP at any moment. Testing a chunk’s axis-aligned bounding box against the frustum planes is cheap on the CPU and eliminates most of the loaded chunks every frame. Beyond frustum culling, the short draw distance implied by the N64’s fill rate budget means the active rendering volume is small regardless. Setting fog to begin at 200 to 400 world units and terminate geometry at 500 to 600 units is not a creative failure; it is a technical requirement imposed by the rasterizer.

The Modern Toolchain That Makes This Tractable

The homebrew ecosystem has changed significantly since the N64’s commercial era.

libdragon is an open-source, MIT-licensed N64 SDK maintained actively through 2025. Its rdpq subsystem provides a high-level command queue abstraction over the RDP, handling display list construction, TMEM management, and synchronization primitives. This is a significant departure from writing raw display lists by hand, which requires precise knowledge of RDP state transitions and sync command placement to avoid pipeline stalls.

The t3d (Tiny3D) engine layer, built on libdragon, provides scene graph management, model loading in the model64 format, skeletal animation, and draw call batching. It handles the RSP microcode upload and execution cycle automatically. A developer building an open-world engine on top of this stack can focus on the world partitioning and streaming logic rather than the lower-level pipeline mechanics.

For development iteration, ares provides accurate-enough N64 emulation to catch most hardware-specific bugs without requiring physical hardware for every test cycle. The n64brew community and its associated Discord server maintain a wiki and accumulated devlog documentation that cover the kinds of problems a streaming engine encounters directly. The annual N64brew Game Jam has produced a range of 3D projects exploring terrain rendering, outdoor environments, and procedural geometry, with source code available.

The build toolchain is now a GCC cross-compiler for MIPS (mips64-elf-gcc) with Docker images available for reproducible setup. Developers who worked on commercial N64 titles in the 1990s used a proprietary SGI SDK under NDA with limited documentation. The reverse-engineered, community-documented equivalent available today is in many respects more accessible.

Building the Engine

The core architecture of a chunk-streaming open-world engine on N64 follows a pattern familiar from more modern hardware, compressed to fit the constraints.

The world is divided into a fixed grid of chunks. Each chunk stores heightmap data, precomputed triangle meshes for each LOD level, and texture coordinate data. On cartridge, chunks are stored at known ROM offsets so that the PI DMA address calculation is a simple multiply. The engine maintains a ring buffer of loaded chunks centered on the player’s position; as the player crosses a chunk boundary, the engine queues PI DMA transfers to load newly adjacent chunks into the ring buffer slots freed by now-distant chunks.

Terrain geometry can be generated from heightmaps either at build time or at runtime. Offline generation produces smaller per-frame CPU costs at the expense of cartridge space; runtime generation saves ROM but costs CPU cycles per frame. At the N64’s polygon budget, a 64x64 terrain chunk produces 8,192 triangles at full resolution, which is too many. Subdivision to 32x32 or 16x16 for most chunks, with higher resolution only in the chunk immediately surrounding the player, brings the triangle count into range for the RSP’s transform throughput.

For the RDP pass, the key discipline is submitting geometry in texture-sorted order. All triangles sharing a TMEM tile should be grouped into a single display list segment, minimizing the number of TMEM reload commands. Combined with atlas-packed terrain textures, a well-structured display list for a frame of outdoor terrain might require only one or two TMEM loads for the entire terrain pass, with all other state changes confined to transform matrix updates and vertex buffer swaps.

The Portal64 project, before its DMCA takedown by Valve, demonstrated a related culling technique: portal-based frustum narrowing, where each rendered portal refines the camera frustum for the next room. The same principle applies to open-world sector streaming; define coarse sectors larger than individual chunks, determine which sectors are potentially visible from the current one, and skip even the frustum test for geometry in non-visible sectors. This adds a precomputation step but reduces per-frame CPU work as world size grows.

Why This Matters Now

There is a practical and a philosophical dimension to this kind of project.

The practical dimension is that libdragon and the surrounding toolchain have reached a maturity where a single developer, working with modern tooling, can build systems-level N64 software that would have required a full studio team in 1997. The documentation that studios kept proprietary is now reverse-engineered, written into community wikis, and embodied in the SDK itself. The n64.dev reference and the ultra64.ca archives contain hardware-level detail that no public document described during the N64’s commercial lifetime.

The philosophical dimension is about constraints as a clarifying force. The N64’s memory pressure, fill rate budget, and 4KB texture cache are not obstacles to work around; they are the problem domain. An open-world engine that fits in 4MB and streams from a 15 MB/s bus has made explicit every tradeoff that higher-level engines obscure. The decision about chunk size, LOD thresholds, texture atlas layout, and DMA double-buffering strategy are not implementation details to defer; they are the engine. Working at this level forces a clarity about what rendering fundamentally costs that is genuinely useful for understanding higher-level systems.

The N64 never got a real open-world streaming engine during its commercial lifetime. What it got were smart partitioned worlds dressed in fog and loading screens. The gap between those commercial solutions and a genuine streaming engine was not a matter of hardware capability; it was a matter of time, tooling, and the accumulated documentation that the homebrew community spent two decades producing. That gap is now closable by one developer with a laptop and enough patience to think carefully about kilobytes.

Was this interesting?