Streaming a World Through 4 Megabytes: What N64 Open-World Engineering Actually Demands

Someone recently published a video documenting how they built a custom open-world engine for the Nintendo 64, and the Lobsters thread it generated is worth reading alongside the video itself. The demo is technically impressive. What the project illustrates more broadly is that the N64’s hardware forces every resource management problem into the open. You cannot delegate these decisions to a runtime or a middleware layer. The machine requires you to think about them explicitly, at the level of DMA registers and texture cache slots.

The Hardware That Sets the Terms

The N64 runs a NEC VR4300 (MIPS R4300i-compatible) at 93.75 MHz, backed by 4 MB of Rambus DRAM, expandable to 8 MB with the Expansion Pak. The Rambus architecture has high sequential bandwidth (close to 500 MB/s theoretical) but substantial per-access latency. That profile strongly incentivizes bulk DMA transfers over scattered pointer reads, and the entire system is designed around it.

The rendering subsystem is the Reality Co-Processor (RCP), a custom 62.5 MHz ASIC split into two units. The RSP (Reality Signal Processor) is a MIPS-based scalar CPU with a 32-lane SIMD vector unit, 4 KB of on-chip instruction memory, and 4 KB of on-chip data memory. It runs uploaded microcode programs responsible for geometry transform, lighting, and audio mixing. The RDP (Reality Display Processor) is a fixed-function rasterizer that receives triangle commands from the RSP and writes pixels to the framebuffer in RDRAM.

The RDP has 4 KB of on-chip texture cache, called TMEM. That 4 KB is the constraint that shapes every texture-related decision in N64 development.

The Streaming Problem

An open world cannot fit in 4 MB. A single small zone might, but a continuous overworld with multiple terrain regions, their associated geometry, textures, and the working set for the currently running game code requires a streaming system. Content must move from cartridge ROM into RDRAM as the player traverses the world, replacing areas that are no longer needed.

The N64 Peripheral Interface (PI) bus handles cartridge reads via hardware DMA. The programming model is direct: write source and destination addresses and a transfer length into the PI registers, and the hardware performs the copy asynchronously while the CPU continues executing. An interrupt fires on completion.

/* Initiate a DMA read from cartridge ROM to RDRAM */
void pi_dma_read(void *dst, uint32_t cart_src, uint32_t len) {
    volatile uint32_t *pi = (volatile uint32_t *)0xA4600000;
    pi[1] = (uint32_t)dst & 0x00FFFFFF;  /* PI_DRAM_ADDR */
    pi[2] = cart_src;                     /* PI_CART_ADDR */
    pi[4] = len - 1;                      /* PI_WR_LEN: triggers transfer */
}

Sequential PI reads peak around 50 MB/s in practice. A 512 KB world chunk transfers in roughly 10 ms. That is a workable budget if you pipeline the loads: begin fetching the next chunk before the player reaches its boundary, so the transfer completes before the data is needed. The standard approach divides the world into a grid of fixed-size cells and maintains a ring buffer of loaded cells centered on the player’s current position, scheduling prefetch DMA transfers whenever the player crosses a cell midpoint.

Cell sizing involves a real tradeoff. Larger cells reduce the frequency of loads and allow richer content per zone, but they consume more of the RDRAM budget during any given frame. Smaller cells keep memory pressure low but require more precise prefetch timing and can cause visible pop-in if the player moves faster than the PI pipeline can service.

The 4 KB Texture Cache

The TMEM constraint deserves more attention than it usually gets in discussions of N64 hardware. Four kilobytes of on-chip cache is the entire texture budget visible to the RDP for any single draw call. A 64x64 RGBA16 texture occupies 8 KB, which overflows TMEM entirely. A 32x32 RGBA16 texture is 2 KB. A 32x32 CI4 (4-bit color index) texture is 512 bytes.

When a draw call references a texture not currently in TMEM, the RSP must DMA it into the cache from RDRAM before the RDP can proceed. Every such transfer is a pipeline stall. The cost compounds quickly if draw calls are submitted in depth-sorted order, because geometry that alternates between different textures will thrash the cache on every triangle.

The mitigation is to sort draw calls by texture first rather than by depth, accepting some overdraw in exchange for fewer TMEM loads. Alternatively, a texture atlas small enough to fit entirely within 4 KB can eliminate mid-scene swaps for terrain tiling, at the cost of lower per-tile resolution. Commercial N64 games use textures that look aggressively small by modern standards: 16x16, 32x32, or 64x32 pixels are common, often in CI4 format with hand-authored 16-color palettes that are themselves packed into the remaining TMEM space alongside the texture data.

How Commercial Studios Solved the Same Problems

The Legend of Zelda: Ocarina of Time decompilation project provides exact, compilable C source for how Nintendo approached these tradeoffs. The world is divided into scenes (discrete areas) and rooms (sub-areas within a scene). The Hyrule Field overworld is a single large scene that fits in RDRAM as a unit. Transitions between areas occur during controlled fade-outs; the game loads the next scene synchronously while the screen is dark. There is no background streaming in the conventional sense. Scene data is stored contiguously on the 32 MB ROM so that each scene loads in a single sequential PI DMA burst.

Within a scene, the engine maintains a fixed actor table with a hard ceiling around 100 entries. Actors beyond a distance threshold are evicted from the table entirely and recreated when the player approaches again. The texture management in OoT reflects the TMEM constraint throughout: most in-world textures are 32x32 or smaller, and draw call ordering within each display list groups geometry by shared texture where possible.

Banjo-Tooie went further. Rare’s 2000 game targets the Expansion Pak (8 MB required) and features worlds substantially larger than any OoT area, with connections between zones that do not require full scene transitions. Rare partitioned worlds into sub-areas connected by portals and used PI DMA to prefetch the neighboring sub-area while the player was still in the current one. The doubled RDRAM budget made this tractable; on 4 MB it would require much tighter geometry budgets per sub-area.

The Super Mario 64 decompilation reveals a different tradeoff. SM64 uses entirely discrete courses with no streaming between them. Each course’s complete dataset fits in RDRAM as a unit, eliminating streaming complexity at the cost of strict per-course asset budgets. The engine’s simplicity is part of why the decompiled source has been ported to so many platforms.

The Modern Homebrew Stack

Building an N64 homebrew engine today means working with libdragon, the open-source SDK that provides a GCC-based toolchain, a display list API, controller input, and USB debug output via EverDrive or 64drive hardware. Recent libdragon versions include a GL 1.1-subset layer that maps OpenGL calls to the RSP/RDP pipeline, substantially lowering the entry barrier for developers without prior N64 experience.

The tiny3d engine builds on libdragon’s T3D microcode to add skinned mesh animation, texture atlases, and a full 3D rendering pipeline. It is the most complete publicly available N64 3D engine and gives open-world homebrew projects a foundation that would have been unavailable a few years ago.

The N64Brew community wiki documents the PI registers, the RSP instruction set, and audio subsystem mechanics in sufficient depth to implement DMA streaming from scratch. The annual N64Brew GameJam has produced open-source voxel renderers, raycasters, and 3D engines that collectively map the practical performance envelope of the hardware.

What This Teaches

The N64 hardware does not hide anything from you. Every texture cache miss has a measurable cost. Every DMA transfer has a latency you must account for. Every triangle submitted to the RSP occupies RSP cycle time. Open-world streaming on modern hardware involves the same fundamental mechanisms: async I/O, predictive prefetching, LOD selection, draw call batching, and texture atlas management. Modern engines implement these automatically or expose them through high-level configuration parameters.

On the N64, you implement them at the register level. The constraints are not abstractions; they are the actual numbers. Building for this hardware is a systems programming exercise that happens to produce a game. The video linked above is worth an hour of your time, and the decompilation projects for OoT, SM64, and Banjo are worth considerably more if you want to understand how the studios that shipped on this hardware thought about the same set of problems.