· 8 min read ·

Streaming an Open World Through 4 KB of Texture Memory

Source: lobsters

The N64 is famous for having just 4 KB of texture memory. That is not a typo. The Reality Display Processor’s on-chip texture cache is four kilobytes. A single 64x64 RGBA32 texture is 16 KB, four times what fits in TMEM. Every texture that appears on screen has to be tiled, atlased, or swapped in during the display list’s execution, one 4 KB slot at a time. That constraint shapes everything about N64 game design, and it is the first thing you run into when you start thinking about open-world streaming on this hardware.

This video by a homebrew developer building a working open-world engine for real N64 hardware is interesting precisely because it is not a demo or a proof of concept. It is a functional streaming engine targeting the same hardware that shipped in 1996. Understanding what went into it requires understanding why commercial studios mostly gave up on open worlds for this platform.

The Memory Budget

The N64 has 4 MB of RDRAM in its base configuration, expandable to 8 MB with the Expansion Pak. That sounds tight by modern standards, but the actual working budget is considerably smaller. A double-buffered 320x240 16bpp framebuffer consumes about 600 KB. The Z-buffer takes another 300 KB. The game executable, stack, and audio buffers consume roughly another 700 KB. On a base system, that leaves about 2.4 MB for everything else: geometry, textures, collision data, game state, AI, and anything you intend to stream.

With the Expansion Pak, the available pool grows to about 6.3 MB. Most open-world work on N64 relies on the Expansion Pak for exactly this reason. Majora’s Mask, Donkey Kong 64, and Perfect Dark all required it. A homebrew engine targeting any meaningful world size almost certainly does too.

The cartridge DMA bandwidth is approximately 20 MB/s sustained, through the Peripheral Interface controller. At 30 fps (33ms per frame), you can transfer roughly 660 KB per frame via DMA. At 60 fps, that halves to about 330 KB. These numbers define your streaming budget directly: how large can a world chunk be, and how fast can you page the world in as the player moves?

/* Initiate async DMA from cartridge ROM to RDRAM (libdragon) */
dma_read(rdram_dest, rom_src, chunk_size);
/* PI_WR_LEN_REG write starts the transfer */
/* Poll PI_STATUS_REG bit 0 or wait for PI interrupt */

The PI DMA controller is fully asynchronous. While it is transferring a chunk from cartridge ROM into RDRAM, the CPU is free to run game logic. This is the mechanism that makes streaming possible at all: you queue a load, run physics and AI for the current frame, and by the time you are ready for the next frame, the next chunk is resident.

Why Commercial Games Avoided This

Looking at the N64’s library, the pattern is consistent. Super Mario 64 loads each course as a unit, using level transitions to mask the latency. Ocarina of Time uses a room-based system where moving through a doorway triggers loading the next room, with fog set close enough that geometry beyond loaded rooms is never visible. Banjo-Kazooie follows the same discrete-world approach.

The famous N64 fog is not purely an aesthetic choice. It is a rendering budget management tool. If the fog distance is set to 200 units, you only need geometry resident within 200 units of the player. Rooms beyond that threshold can be unloaded without the player ever seeing a pop. The fog does double duty: it hides the streaming boundary, and it reduces the polygon count the RDP has to rasterize, which directly affects frame rate.

Ocarina of Time’s Hyrule Field appears open, but it fits entirely in RAM. The field itself is small enough that Nintendo could keep it fully resident. The visual sense of scale comes from horizon geometry and skybox work, not from an actually large world. Majora’s Mask takes a similar approach with Clock Town: each district is a separate room loaded through transition corridors.

Banjo-Tooie pushed this further than almost any N64 title, using background DMA to pre-load adjacent world sections while the player occupied the current area. Adjacent sections of Spiral Mountain and connected worlds could be partially resident simultaneously, which is about as close to real streaming as any commercial N64 game got. But it was still room-based, not a continuous open world.

The honest reason commercial developers avoided true streaming open worlds is that the hardware makes it genuinely difficult, and the tools of the era made it harder still. Nintendo’s official SDK ran on SGI Indy workstations under IRIX. Iteration times were long. Building a streaming system under those conditions, on a production schedule, was not viable for most studios. You built what you could ship.

What a Modern Developer Brings

The developer behind this project has several advantages that 1996-era studios did not.

Complete hardware documentation is the first. The N64’s hardware was largely undocumented officially, with studios relying on SDK abstractions and Nintendo’s direct support. Decades of reverse engineering have produced thorough documentation of every register in the PI, VI, RDP, RSP, and audio DAC. You can write directly to PI_CART_ADDR_REG without going through a middleware layer, because you know exactly what it does.

Libdragon, the modern open-source N64 SDK, is the other major factor. Libdragon ships with a GCC 12+ MIPS toolchain, an rdpq API for building RDP command queues with direct control over tile descriptors and texture loads, a full audio pipeline, and an OpenGL 1.1 subset that compiles down to N64 display lists. For a streaming engine, rdpq matters most: it gives you direct access to the texture loading machinery without opaque SDK abstractions getting in the way.

Modern compression algorithms also change the calculus. LZ4’s decompression throughput is fast enough that a 93.75 MHz MIPS CPU can decompress data in real time after a DMA transfer completes. Storing world chunks LZ4-compressed in ROM effectively extends the cartridge bandwidth: you transfer fewer bytes over the PI bus, then decompress in RDRAM. At typical compression ratios for terrain geometry and texture data, this can meaningfully expand the practical streaming budget without touching any hardware limits.

The Architecture of a Chunk Streaming System

A functional N64 open-world engine needs to solve a small set of concrete problems. The world is divided into chunks arranged in a grid. The chunks resident in RAM at any time form a ring around the player, typically a 3x3 or 5x5 grid centered on the player’s current position. As the player moves, trailing-edge chunks are evicted and new leading-edge chunks are queued for loading.

Each chunk contains its geometry display list, collision mesh, texture data, and object spawn table. A chunk at 128 KB is a reasonable target: small enough to load in about 6.5ms at 20 MB/s DMA throughput, which sits well within a 33ms frame budget at 30 fps, and large enough to hold meaningful terrain geometry and a texture set.

TMEM management is where this gets precise. The 4 KB TMEM is not persistent between frames: it is a staging area that the display list populates during rendering. Every texture reference in a display list is a LoadBlock or LoadTile command that copies texture data from RDRAM into TMEM at render time. A well-constructed display list sequences these loads to avoid redundant transfers and to minimize RDP pipeline stalls between geometry batches.

For terrain specifically, textures are typically pre-baked into atlases during the ROM build process. A terrain chunk’s display list references specific TMEM regions by offset, and those regions are populated at the start of rendering that chunk. With 4 KB of TMEM and 16bpp textures, you can hold roughly four 32x32 textures simultaneously. With 4bpp CI (color-index) format and a 16-color palette, you can fit considerably more. The palette constraint requires careful art pipeline decisions, but it is often the right trade on hardware this constrained.

The RSP as a Coprocessor

The RSP’s 4 KB of DMEM and 4 KB of IMEM make it a limited but capable processor for tasks beyond geometry. The standard F3DEX2 microcode fills IMEM with the vertex transform and clipping pipeline, but custom microcode can repurpose the RSP for decompression or chunk preprocessing.

Libdragon’s custom microcode infrastructure, documented in the repository, makes writing RSP programs tractable without the SGI toolchain. The RSP’s 8-lane 16-bit SIMD vector unit is well-suited for parallel decompression: unpacking LZ-encoded data across multiple lanes simultaneously. Offloading decompression to the RSP frees the CPU entirely for game logic during chunk loads, which is the goal: the CPU should never be waiting on I/O.

This is the kind of RSP knowledge that was technically available in 1996 but rarely exploited outside of first-party Nintendo studios. Second and third-party developers mostly used the provided microcode blobs and never wrote RSP assembly. The documentation to do otherwise simply was not accessible.

The Development Loop

N64 homebrew development has matured considerably in the past few years. The combination of libdragon, flashcarts like the 64drive and EverDrive 64 X7, and cycle-accurate emulators like ares makes the development loop comparable to targeting a modern embedded platform. You write code, build a ROM with libdragon’s n64.mk, push it over USB to real hardware or load it in ares, and get feedback in seconds. Symbolicated crash logs over USB, courtesy of libdragon’s n64sym tool, mean debugging on real hardware is not the blind exercise it was in the 1990s.

An open-world streaming engine for the N64 is not a novelty project. It is a thorough exercise in memory-constrained systems programming: DMA scheduling, async I/O patterns, texture budget management, LOD systems, and SIMD-assisted decompression, all running on a machine with no operating system, no virtual memory, and 8 MB of RAM if you are lucky enough to target the Expansion Pak. The constraints are severe enough that every architectural decision has immediate, measurable consequences. There is no slack in the system to hide bad choices.

What commercial developers knew in 1996 is that this was possible, given enough time and the right hardware knowledge. What they did not have was either. A homebrew developer in 2025, armed with complete documentation, a modern toolchain, and an unlimited development timeline, is working a different problem. The hardware is the same. The constraints are the same. The question is just whether the engineering is good enough to work within them.

Was this interesting?