· 8 min read ·

Open-World Streaming on Four Megabytes: What the N64's Architecture Forces You to Understand

Source: lobsters

The N64 has 4MB of RAM. With the Expansion Pak it doubles to 8MB, though that was an optional purchase and most players never owned one. Before a single byte of game geometry or texture enters that budget, the hardware claims a significant share: the framebuffer at 320x240 in 16-bit color takes 300KB, the Z-buffer takes another 300KB, the OS and audio buffers consume roughly 512KB, and display lists and RSP task structures require additional overhead. On a base system, roughly 2.8MB is available for world data. That is the envelope within which this video on building an open-world N64 engine has to operate.

What makes the project technically interesting is not the novelty of building on old hardware for its own sake. It is that the N64’s architecture exposes every problem that open-world streaming requires solving, without any abstraction layer to obscure them. Modern engines hide these problems behind large asset caches, high-speed NVMe pipelines, and GPU driver machinery. On N64, all of it is manual.

The RCP and Why It Shapes Everything

The N64’s Reality Coprocessor (RCP) runs at 62.5 MHz and contains two sub-units: the RSP (Reality Signal Processor) and the RDP (Reality Display Processor). Understanding their internal memory sizes is essential to understanding every architectural decision in an N64 engine.

The RSP is a MIPS-derived processor with a 128-bit vector unit providing eight lanes of 16-bit arithmetic. It handles geometry: vertex transformation, clipping, and lighting. Its instruction memory (IMEM) is 4KB of on-die SRAM. Its data memory (DMEM) is also 4KB. These are not caches with larger backing stores; they are the entirety of the RSP’s working memory. The RSP cannot access RDRAM during execution. Data must be DMA-loaded into DMEM before the RSP can touch it.

The RDP handles rasterization. Its texture memory (TMEM) is 4KB of on-die SRAM. That is where textures live during rendering. A 128x32 RGBA16 texture fills TMEM completely. A 64x64 RGBA16 texture does not fit at all. For 8-bit color-indexed textures with a 256-color palette, a 64x64 tile fits with room to spare, but the palette itself consumes the top 512 bytes of TMEM. Every texture change mid-frame requires a TMEM reload and a pipeline sync command, which stalls the RDP.

The CPU, a MIPS R4300i running at 93.75 MHz, builds sequences of 64-bit commands called display lists in RDRAM. When a frame is ready, it queues an RSP task: the RSP DMA-loads its microcode from RDRAM into IMEM and begins reading the display list. Each gsSPVertex command DMA-loads up to 32 vertices into DMEM, transforms them with the current matrix stack using the vector unit, and stores the results in a vertex buffer in DMEM. Triangle commands then reference those vertices by index. F3DEX2, Nintendo’s standard geometry microcode, holds 32 transformed vertices in DMEM simultaneously. The entire pipeline is batched around that window.

This is the fundamental unit of geometry submission on N64: 32 vertices at a time, each batch fitting inside 4KB of DMEM, with RDRAM as the backing store and cartridge ROM as the ultimate source.

What Streaming Actually Requires

An open-world engine needs to do three things concurrently: load new geometry and textures from storage as the player moves, discard data that is no longer needed, and maintain a consistent frame budget throughout. On PC or modern consoles, fast storage and large GPU memory make this manageable. On N64, cartridge read bandwidth is roughly 50 to 100 MB/s sequential with significant per-transfer setup latency. Every transfer from ROM to RDRAM goes through the PI Manager thread (libultra’s cartridge DMA scheduler), which services one transfer at a time. The CPU queues a DMA request and continues; the transfer completes asynchronously and signals via interrupt. The bandwidth ceiling is fixed and shared with audio streaming.

Chunk size is therefore a careful calculation. A chunk large enough to prevent visible pop-in at normal player speeds must also be small enough to DMA-load in the time available before the player reaches it. Geometry and texture data in the range of 16KB to 64KB per chunk is practical for homebrew projects on N64. Larger chunks risk stalls where the player outpaces the loader. Smaller chunks reduce stall risk but increase the per-frame overhead of chunk management and DMA scheduling.

The geometry within each chunk must be precompiled into display lists stored on the ROM. Generating display lists dynamically from raw vertex data on the CPU costs too many cycles per frame. Commercial N64 games stored level geometry as ready-to-execute display list data on the cartridge; the CPU calls it via gsSPDisplayList. The same approach is necessary for any streaming open-world system. The ROM contains not meshes in the traditional sense, but display list programs that the RSP executes directly.

Texture Budget Under 4KB

TMEM forces every visual decision about terrain and environment. Tileable textures are not a stylistic choice on N64; they are a constraint. Zelda: Ocarina of Time’s Hyrule Field uses a small grass texture repeated across the entire ground plane, with vertex color variation providing the impression of detail. The fog, matching the sky color at the horizon, hides geometry pop-in by blending distant chunks into the background before they are visible as geometry. The fog is load-bearing infrastructure, not decoration.

An open-world engine on N64 requires the same approach. Terrain textures must fit within TMEM and be tileable at world scale. Texture atlases help where variety is needed: packing multiple terrain tile types into a single TMEM-sized atlas allows the RDP to render many surface types within a single batch before needing a TMEM reload. Draw call ordering matters significantly because each texture change requires a SYNC_TILE command and a fresh upload, both of which pause the RDP pipeline.

Color-indexed (CI) textures are worth understanding here. An 8-bit CI texture at 64x64 resolution consumes 4096 bytes of texel data, exactly filling TMEM’s texture area, with the 256-color palette living in the upper 512 bytes. The effective texel memory efficiency is doubled compared to RGBA16. Terrain tiles using CI8 can be larger or more numerous for the same TMEM cost, at the price of a 256-color palette limit per tile. Most N64 terrain systems use CI textures for exactly this reason.

CPU-Side Culling Is the Core Work

The RSP under F3DEX2 processes roughly 100,000 triangles per second at peak. At 30 fps that is around 3,300 triangles per frame. An open world with any meaningful geometry density exceeds this budget immediately, which means the CPU must aggressively reject invisible geometry before it ever reaches the RSP.

Frustum culling is the first pass. The CPU tests each chunk’s bounding volume against the view frustum’s six planes and skips submitting the chunk’s display list if it falls entirely outside the frustum. The N64’s FPU is slow, particularly for division and square roots, but plane tests using fixed-point arithmetic are feasible and sufficient for bounding sphere or AABB tests. A typical outdoor scene eliminates 70 to 90 percent of world chunks through frustum culling alone.

Factor 5’s work on Rogue Squadron illustrates what aggressive culling buys. Their custom RSP microcode was more efficient than Nintendo’s stock geometry code, and their CPU-side culling was correspondingly strict. The result was 60 fps in some configurations and genuine mid-level terrain streaming via a quadtree LOD system, on the same hardware that ran Zelda: Ocarina of Time at a 20 fps target with no streaming at all. The difference was not the hardware; it was the engineering discipline applied to the CPU-side budget.

A coverage buffer adds a second culling pass. Maintain a coarse occlusion mask, perhaps 20x15 pixels of 1-bit coverage, updated by the CPU as near geometry is submitted. Test distant chunks against it and skip those fully occluded by nearer terrain. The RDP has no hardware occlusion query mechanism. The CPU cannot ask the RDP which pixels passed the depth test. Everything is CPU-side approximation, but even a low-resolution mask eliminates a meaningful fraction of geometry in terrain scenes where rolling hills occlude one another.

LOD as a Budget Knob

Distance-based LOD on N64 takes the form of multiple precompiled display lists for the same world chunk: a high-detail version for nearby viewing and progressively simplified versions for greater distances. The CPU selects which display list to call based on player distance from the chunk center. The ROM must contain all LOD variants, which increases cartridge space requirements, but the frame budget savings are substantial.

At extreme distances, billboards replace 3D geometry. A camera-facing quad textured with the chunk’s silhouette renders in two triangles regardless of the original geometry complexity. The RDP’s fill rate in 1-cycle mode is roughly 100 Mpixels per second for untextured geometry, meaning billboard quads have near-zero fill rate cost. The visual transition between billboard and geometry needs to happen behind fog or at a distance where the pop is imperceptible. Matching the fog distance to the LOD transition distance is the standard approach; the fog effectively defines the far boundary of the 3D rendering budget.

The Modern Toolchain

Building this today is meaningfully different from building it in 1998. Libdragon, the open-source N64 SDK, provides a legal and well-documented foundation. Its rspq subsystem manages RSP microcode overlays, allowing geometry and audio tasks to share the 4KB IMEM through managed switching rather than manual scheduling. T3D (Tiny3D), a 3D rendering library built on libdragon, provides LOD, fog, and skinned animation as first-class features rather than hand-rolled engine code.

F3DEX3, a community-developed microcode released in 2023, extends Nintendo’s F3DEX2 with per-vertex ambient occlusion baked into vertex alpha, improved triangle throughput, and better LOD handling. An open-world engine using F3DEX3 can have AO baked into terrain geometry with no per-frame rendering cost, something that was entirely unavailable to commercial N64 developers.

The tooling gap between 1998 and now is substantial. Commercial N64 developers worked with Silicon Graphics IRIX workstations, proprietary debuggers, and documentation that was not always complete even for licensed developers. Modern homebrew developers have GCC cross-compilation, debuggers over USB via EverDrive flash cartridges, high-accuracy emulators like ares and Cen64, and a community that has spent decades reverse-engineering every documented and undocumented behavior of the RCP.

What the Constraints Reveal

The hardware budget on N64 is not abstractly constrained; it is concretely constrained at every layer. The RSP DMEM holds 32 vertices. The TMEM holds one medium-sized texture. The RDRAM holds roughly 2.8MB of game data on a base system. The cartridge delivers around 50 MB/s. Each of these numbers is a hard wall, not a guideline.

Building an open-world engine within these walls forces explicit accounting at every layer. There is no asset cache that silently evicts least-recently-used data; you implement the eviction policy yourself. There is no GPU driver that handles texture residency; you manage TMEM explicitly. There is no spatial query acceleration structure provided by the engine; you build the culling hierarchy from scratch.

The underlying problems are the same ones that Unreal Engine’s World Partition system and Unity’s Addressables pipeline solve at far larger scale and with considerably more infrastructure. Streaming, culling, LOD selection, texture budget management: the N64 version and the modern version are solving the same problem. Working at the N64 level, where there is nothing between your code and the silicon, provides a clearer view of why modern systems are designed the way they are. The constraints do not simplify the problem. They just make it impossible to ignore.

Was this interesting?