· 6 min read ·

Borrowing System RAM and NVMe for GPU VRAM: What Transparent Memory Extension Actually Requires

Source: lobsters

The VRAM wall is one of the defining constraints of modern AI workloads. A 70B parameter model in fp16 needs roughly 140 GB of GPU memory just to hold its weights, before accounting for KV cache, activations, or batch size. Consumer GPUs top out at 24 GB. Even professional cards like the RTX 4090 fall well short of what many inference workloads demand. The gap between what fits on the GPU and what the job needs has spawned a cottage industry of partial solutions: model quantization, layer offloading, tensor parallelism across multiple GPUs, and explicit CPU offloading in frameworks like vLLM and llama.cpp.

A project called nvidia_greenboost takes a different approach: transparently extend GPU VRAM using system RAM or NVMe storage, without requiring changes to the applications consuming that memory. The word “transparently” is doing a lot of work in that description, and understanding what it means in practice requires a look at how CUDA memory allocation works and where the seams are that a tool like this can intercept.

How CUDA Memory Allocation Works

When a CUDA application calls cudaMalloc, the CUDA runtime library contacts the CUDA driver, which coordinates with the GPU driver to carve out a region of physical VRAM. The returned pointer is a device address, valid only within GPU kernels and device-side operations. When VRAM is exhausted, cudaMalloc returns CUDA_ERROR_OUT_OF_MEMORY and the application either crashes or falls back to a CPU path, if one exists.

NVIDIA has offered an alternative since CUDA 6.0: Unified Memory, allocated with cudaMallocManaged. Unified Memory presents a single pointer valid on both CPU and GPU. The driver handles page migration between host and device memory automatically, using hardware page fault mechanisms introduced with Pascal-architecture GPUs. When a GPU kernel accesses a page that currently lives in host memory, the driver migrates it to VRAM before the access completes. This sounds like exactly what a VRAM extension tool wants, but Unified Memory has significant practical limitations.

The page migration path adds substantial latency. Migration happens at 64KB page granularity, and each fault stalls the GPU thread until the page arrives. For workloads with predictable access patterns, cudaMemPrefetchAsync can hide this latency, but that requires the application to anticipate what it will access next. On Linux, Unified Memory historically requires all host-side pages to be pinned (non-swappable), which constrains how much system RAM can back GPU allocations in practice.

The Interception Approach

A transparent VRAM extension tool that does not modify applications has a few implementation paths available on Linux. The most practical for a user-space project is library interposition via LD_PRELOAD: loading a custom shared library before libcuda.so or libcudart.so, then providing replacement implementations of cudaMalloc, cuMemAlloc, cudaMallocManaged, and their variants.

The shim intercepts allocation requests and applies a tiering policy. If VRAM has sufficient space, the request passes through to the real CUDA allocator unchanged. When VRAM is exhausted, the shim allocates in pinned host memory using cudaMallocHost, or maps a file on NVMe into host memory and registers those pages with cudaHostRegister to make them DMA-accessible from the GPU. Device pointers into pinned host memory are valid inside GPU kernels; the GPU accesses the data over PCIe.

For NVMe backing specifically, NVIDIA’s GPUDirect Storage (GDS) enables direct DMA between NVMe and GPU memory, bypassing the CPU and the system memory bus entirely. That path requires compatible NVMe hardware and specific driver support, so a software-only fallback via mmap plus cudaHostRegister is more universally applicable, if slower.

CUDA 10.2 introduced a lower-level virtual memory management API, cuMemCreate and cuMemMap, which allows applications to allocate virtual GPU address ranges and back them with physical memory on demand, including memory from different sources. This is the cleanest foundation for oversubscription, but using it correctly from a shim layer is considerably more complex than wrapping cudaMalloc.

The Performance Reality

The performance hierarchy matters enormously for understanding when VRAM extension is useful versus when it will cripple throughput.

GDDR6X VRAM on an RTX 4090 delivers around 1 TB/s of bandwidth internally. System RAM on DDR5 reaches perhaps 80-100 GB/s aggregate, but a GPU accessing it over PCIe 4.0 x16 is capped at roughly 32 GB/s in a single direction. PCIe 5.0 doubles that ceiling to around 64 GB/s. NVMe SSDs, even fast PCIe 5.0 drives, top out near 14 GB/s for sequential reads.

So accessing system RAM from the GPU through a PCIe interpose is roughly 15-30x slower than VRAM access, and NVMe is 70-100x slower. For workloads where GPU kernels access the overflow region infrequently, this is tolerable. The canonical case is LLM inference: model weights for transformer layers that are not currently executing can live in system RAM or even on NVMe, with only the active layer’s weights in VRAM. llama.cpp implements exactly this with its -ngl flag, giving explicit control over how many transformer layers reside on the GPU. The advantage of a transparent approach is that the application does not need to know the tiering is happening.

For training workloads, or any inference path with fine-grained, unpredictable access patterns across a large working set, the overhead of traversing PCIe on every cache miss will dominate computation time. Transparent VRAM extension is not a general-purpose solution for those cases.

Prior Art and Ecosystem Context

DeepSpeed’s ZeRO-Infinity treats NVMe as a first-class tier for optimizer states and parameters during training, enabling models far larger than GPU or even system RAM can hold. vLLM implements CPU offloading at the framework level with explicit awareness of which tensors to move and when. These approaches are deliberate and tunable; the developer controls the tiering policy.

Apple Silicon’s unified memory architecture sidesteps the PCIe bottleneck entirely: CPU and GPU share the same physical memory pool with no transfer overhead, which is why a 192 GB M3 Ultra system can run very large models efficiently. On discrete GPU systems, the PCIe bus is unavoidable.

The idea of intercepting CUDA allocations via LD_PRELOAD is not unprecedented either. Tools like nvtop and various profiling shims have used library interposition for GPU monitoring. What makes the memory allocation case harder is that the shim is no longer read-only; it must correctly manage allocation lifetimes, handle concurrent requests from multiple threads and CUDA contexts, forward the right attributes to the underlying allocator, and deal correctly with async allocation variants introduced in newer CUDA releases.

The Vibecoding Question

The Lobsters submission flagged nvidia_greenboost for vibecoding based on the presence of a .continue/agents directory in the repository. Continue is an open-source AI coding assistant that supports agentic workflows, where the AI takes multi-step autonomous actions over the codebase. The .continue/agents path suggests the project was developed with heavy AI assistance in that mode.

This raised eyebrows specifically because of what the code does. Library interposition for CUDA memory allocation is not beginner territory. A correct shim must handle CUDA context management, thread safety for concurrent allocations, the distinction between the runtime API (cudaMalloc) and the lower-level driver API (cuMemAlloc), proper forwarding of allocation flags and attributes, and correct behavior when the upstream allocator is invoked recursively or from within a CUDA callback.

A subtle mistake in any of these areas does not produce a clear compile error or an obvious runtime failure. It produces silent data corruption, intermittent GPU kernel crashes, or memory leaks that appear only under specific driver versions, hardware configurations, or concurrency patterns. These are exactly the failure modes that are hardest to catch in review and hardest to trace in production.

AI coding assistants are capable of generating plausible-looking implementations of complex systems code. They are also capable of implementing the common path correctly while handling edge cases incorrectly, because edge case handling often requires precise knowledge of underdocumented driver behavior and failure modes that are underrepresented in training data.

None of this means the project is wrong. The concept is technically sound, the need is genuine, and the general approach has established precedent. What it means is that the bar for validating this kind of code is high, and “it works on my machine” against a single GPU and driver version is a low bar for code that intercepts fundamental memory primitives in a shared address space. A comprehensive test suite covering failure modes, driver version compatibility, and multi-threaded allocation scenarios would go a long way toward establishing confidence in the implementation regardless of how it was written.

The VRAM scarcity problem is real enough that a working, reliable solution in this space would be genuinely valuable. Whether this particular implementation clears that bar is what testing and code review will determine.

Was this interesting?