· 7 min read ·

The Memory Budget Problem at the Heart of N64 Open-World Development

Source: lobsters

When someone publishes a video titled How I Built an Open-World Engine for the N64, the interesting part is not the open world itself. The interesting part is everything the N64’s hardware makes you give up to get there, and the engineering you have to do to get it back.

The Nintendo 64’s hardware profile is unusual by any modern standard. Its NEC VR4300 CPU runs at 93.75 MHz and is technically a 64-bit MIPS core, though almost all game code ran in 32-bit mode because the wider registers bought nothing against the tiny 8 KB data cache. The RCP, SGI’s Reality Co-Processor running at 62.5 MHz, splits into two distinct units: the RSP (Reality Signal Processor), which handles geometry and audio via swappable microcode, and the RDP (Reality Display Processor), which does rasterization. Base RAM is 4 MB of RDRAM, expandable to 8 MB with the Expansion Pak. Framebuffers eat about 300 KB each at 320x240 in 16bpp, and the Z-buffer takes another 300 KB, so the actual working budget for game data on a stock machine lands somewhere between 2 and 2.5 MB.

That 2 MB has to hold the current level geometry, all active object data, audio, and any streaming buffers. Every N64 game ever shipped commercially solved this by dividing the world into discrete chunks separated by load screens or door transitions. Super Mario 64 loads each course entirely into RAM and keeps it resident. Ocarina of Time’s Hyrule Field is geometrically simple enough to sit in memory all at once, a single room with roughly 300 polygons of terrain masked by distance fog. Banjo-Kazooie loads each world in full. GoldenEye uses portal-based visibility culling to manage large indoor spaces, but the entire level still lives in RAM before you start. No commercial N64 title streamed geometry continuously as the player moved through a seamless open world.

This is precisely what makes the homebrew project interesting. Building open-world behavior on the N64 requires solving a streaming problem the hardware designers treated as optional.

Cartridge DMA as a Streaming Mechanism

The N64’s cartridge connects via the PI (Peripheral Interface) bus, which supports direct memory access from ROM to RDRAM. Sustained transfer rates land around 12 to 16 MB per second in practice, with theoretical peaks near 23 MB/s. The latency to initiate a DMA transfer runs over a millisecond, which means you cannot issue transfers on-demand in response to player movement. You have to anticipate.

The standard approach for an open-world system is to divide the world into spatial cells and maintain a ring of loaded cells around the player’s current position. As the player moves, cells entering the near threshold get queued for DMA load while cells falling outside the far threshold get marked free. The DMA transfers run asynchronously, so the CPU has to manage a pending queue and double-buffer the cell slots so a transfer in progress never overwrites a cell still in use by the renderer.

At 16 MB/s, you can stream roughly 800 KB per frame at 20 fps. That sounds generous until you account for the fact that most of that budget is already spoken for by audio streaming and object data. A realistic geometry streaming budget per frame is closer to 100 to 200 KB, which means each cell has to fit in that envelope or the engine falls behind player movement speed.

Keeping cells small enough requires aggressive geometry compression. N64 display lists encode vertex positions in 16-bit fixed point, so a terrain mesh with 100 triangles costs around 1.2 KB for vertex data plus display list commands. A cell covering a reasonable patch of terrain can fit in 8 to 16 KB without heroic effort, which means the streaming bandwidth is actually sufficient. The harder constraint is RDRAM space: if you keep six cells loaded at once across a 3x3 grid minus the center, you’re consuming 50 to 100 KB for terrain geometry alone. That leaves room in the 2 MB budget, but not much.

The TMEM Problem

RDP rasterization depends on a 4 KB on-chip texture cache called TMEM. This is the RDP’s only texture storage. Every texture used in a draw call must be uploaded into TMEM before the call executes, and TMEM is partitioned into two 2 KB banks for two-cycle rendering mode.

A 32x32 texture in RGBA16 format costs exactly 2 KB. A 64x32 texture fills all 4 KB. A 64x64 texture at 16bpp does not fit at all and has to be split into tiles loaded sequentially. For open-world terrain, where you might want varied ground surfaces across a large area, this constraint forces a particular design: a small vocabulary of tiling textures rather than unique surface textures per region.

The practical solution most N64 terrain systems use involves three or four base textures (grass, rock, dirt, sand) sized at 32x32 in CI4 or CI8 palettized format. CI4 uses 4 bits per pixel, so 32x32 CI4 costs 512 bytes plus a 32-byte palette. That leaves room in TMEM for a second texture to blend with the first using the RDP’s two-cycle combiner mode. The combiner equation runs as (A - B) * C + D in two stages, which is enough to interpolate between two surface types based on a vertex color alpha channel baked into the terrain mesh at export time.

This approach produces reasonable terrain variety without any unique textures, which in turn means the texture streaming problem goes away. The full terrain texture set fits in TMEM permanently; only geometry streams from cartridge.

RSP Microcode and the Programmable Co-Processor Angle

The RSP’s instruction DMEM is 4 KB. Its data DMEM is another 4 KB. The CPU loads microcode into this space via DMA, and then the RSP runs as an independent processor executing that code. Nintendo shipped several standard microcodes: Fast3D, F3DEX (with a 32-vertex cache instead of 16), and F3DEX2. Rare wrote entirely custom microcode for GoldenEye and Perfect Dark, which is a large part of why those games achieved visual quality beyond what Nintendo’s standard toolchain could produce.

For an open-world engine, the RSP’s role in geometry processing matters a lot. F3DEX2’s 32-vertex cache means you can submit a 32-vertex batch, apply a matrix transform, and clip in a single RSP task. Larger terrain meshes that share many vertices benefit directly from this. Custom microcode can go further: the Tiny3D library, part of the modern libdragon ecosystem, ships RSP microcode specifically designed for skinned mesh animation, handling vertex transforms on the RSP to free CPU cycles for game logic.

For open-world terrain specifically, a custom microcode could theoretically handle procedural detail pass generation, distance fog coefficient computation, or compressed mesh decoding directly on the RSP. The constraint is that the RSP and RDP share scheduling time: the RSP generates RDP commands into a buffer, and the RDP consumes them. If the RSP is doing extra work per vertex, the RDP’s fill rate sits idle waiting. Microcode optimization for open-world terrain is about finding the right balance between RSP work and RDP work, not simply adding more RSP-side processing.

The Modern Toolchain That Makes This Accessible

A project like this would have been extremely difficult outside Nintendo’s official SDK ten years ago. The current state of libdragon changes that picture significantly. The library ships a GCC 13 cross-compiler targeting the VR4300, a complete RDP command queue API (rdpq) with automatic synchronization and tile management, audio support, and most recently an OpenGL 1.1 subset implementation that translates standard GL calls into optimized rdpq commands. The n64brew community maintains a detailed hardware wiki covering DMA patterns, TMEM layout, display list encoding, and microcode behavior at a level that matches or exceeds what original Nintendo developers had in their internal documentation.

The annual n64brew Game Jam has produced dozens of working homebrew titles since 2020, which means there is a growing body of reference code for things like texture streaming, audio mixing, and scene management on real hardware. The EverDrive-64 X7 flash cart and the 64drive FPGA board with USB debug output mean that iteration on real hardware is now fast enough to be a normal part of development rather than an occasional sanity check.

What Constrained Hardware Forces You to Think About

The N64 open-world problem is a specific instance of a general class of engineering challenge: building a system whose working set exceeds available fast memory, where the fast memory is so small that every allocation decision has visible consequences. The techniques involved, spatial cell streaming, DMA double-buffering, texture vocabulary design, geometry batching for cache efficiency, are not specific to the N64. They appear in embedded game development, in mobile GPU programming where GMEM tile memory plays the same role TMEM does, and in any system where the gap between storage bandwidth and compute throughput has to be bridged by careful scheduling.

The N64 makes these problems concrete and measurable in a way that modern hardware, with its many layers of caching and asynchronous prefetching, tends to obscure. Working through how an open-world engine fits into 2 MB with 16 MB/s of streaming bandwidth is a useful exercise in resource accounting that transfers to other constrained environments. The hardware is forty years old. The engineering principles it forces you to apply are not.

Was this interesting?