· 6 min read ·

Flash Offloading at 397B: Why MoE Architecture Changes the Memory Math

Source: simonwillison

Running a 397-billion-parameter model on consumer hardware sounds like a category error. The numbers do not seem to line up: Qwen 397B at Q4 quantization sits at roughly 200 GB on disk, and no consumer GPU has anywhere near that in VRAM. The common assumption is that you need a rack of H100s to run it at all. Apple’s LLM-in-a-Flash paper (arXiv 2312.11514, December 2023) challenges that assumption, and Simon Willison’s recent writeup documents a practical workflow for applying those techniques to Qwen 397B, using LLM-assisted research to bridge the gap between theory and implementation.

The core claim of the Apple paper is that NVMe flash storage, combined with careful I/O scheduling, can serve model weights fast enough to run inference at usable speeds when DRAM is insufficient. The paper’s contribution is the specific set of techniques that make it tractable.

The Bandwidth Problem

The reason flash offloading raises skepticism is bandwidth. DRAM bandwidth on a modern system sits somewhere between 50 and 400 GB/s depending on the memory configuration. NVMe over PCIe 4.0 tops out around 7 GB/s; PCIe 5.0 roughly doubles that to 13-14 GB/s. Even at PCIe 5.0, you are looking at a 4-30x bandwidth deficit relative to DRAM, depending on the system.

For a dense transformer, every forward pass requires reading essentially all the weights. At 200 GB and 7 GB/s sequential read speed, that is roughly 28 seconds per token, which puts the approach firmly outside any definition of usable inference. The Apple paper’s techniques are designed to close this gap by reducing how much you have to read on each forward pass.

Windowed Access and Row-Column Bundling

The paper introduces two primary techniques. The first is windowed access: recently used weight matrices are cached in DRAM, exploiting the temporal locality of transformer computations. Attention patterns and feed-forward activations do not jump arbitrarily through weight space; there is structure in which weights get used when, and the window cache takes advantage of it.

The second technique is row-column bundling. Sparse reads from flash are expensive not because of sequential bandwidth limitations, but because of I/O operation overhead. Each individual read has latency that dominates at small granularities. Row-column bundling groups sparse weight reads into larger, contiguous I/O operations, trading some unnecessary data transfer for dramatically fewer I/O round trips. The paper shows this substantially improves effective throughput even when the raw bytes transferred increases slightly.

These two techniques address the general transformer case; mixture-of-experts architectures make the approach considerably more practical.

Why MoE Changes the Calculus

Qwen 397B is a mixture-of-experts model. MoE architectures replace the standard dense feed-forward layers with a set of expert networks and a routing mechanism that selects a small subset of experts for each token. If the model has 64 experts per MoE layer and activates 4 of them per token, that is approximately 6% of the expert weight space per forward pass.

This sparsity pattern is structurally different from a dense model. In a dense 397B model, you are reading essentially all 200 GB for every token. In Qwen 397B, the expert weights constitute the bulk of the parameter count, and only a small fraction is needed at any given moment. The activated parameter count per token is much closer to a 30-40B dense model than a 397B one.

This is why flash offloading is more viable for MoE models than for dense models at equivalent parameter counts. The windowed access cache can hold the frequently-used expert weights in DRAM, and row-column bundling can batch the reads for the selected experts into efficient I/O operations. The access pattern is sparse but structured, which is precisely what the Apple paper’s techniques are designed to exploit.

There is also a timing advantage. In a MoE model, the routing decision happens before the expert weights are loaded. You know which experts are needed before issuing any flash reads, which means the I/O can be scheduled precisely rather than speculatively. For a dense model with sparse neuron activation, you typically cannot predict which neurons will fire before running the previous layer, which makes speculative loading harder to get right.

The practical consequence: a machine with 64-96 GB of unified memory and a fast NVMe drive is a plausible inference host for Qwen 397B, with the caveat that throughput will be slower than GPU-resident inference on a smaller model.

Hardware Configuration

Apple Silicon is a natural fit for this approach because of its unified memory architecture. On a standard desktop system, data moves from NVMe to system DRAM to GPU VRAM, with transfer overhead at each boundary. On Apple Silicon, the CPU, GPU, and Neural Engine share a single memory pool; memory bus bandwidth is available directly to all compute units without intermediate transfers.

A Mac Studio with 192 GB of unified memory can hold Qwen 397B’s non-expert weights plus a substantial expert cache entirely in memory, reading only the infrequently-accessed experts from NVMe. A machine with 64 GB is more constrained but potentially still functional at reduced throughput, depending on the expert access pattern for the workload.

For tooling, llama.cpp supports layer offloading, allowing layers to be distributed across VRAM, DRAM, and disk. The --n-gpu-layers flag controls how many layers are loaded to the GPU; setting it below the full layer count causes remaining layers to be served from CPU memory or disk. mlx-lm is the Apple-native path for Apple Silicon, with native support for the unified memory architecture and generally better performance on M-series hardware than the llama.cpp backend.

The Autoresearching Workflow

Willison’s post is notable for the methodology as much as the technical content. He used LLM tools to work through the paper before attempting implementation, a workflow he calls autoresearching. The process involves using a model to summarize sections of a dense technical paper, generate questions about unclear passages, and cross-reference implementation details against the paper’s claims.

This addresses a real gap in applied ML research. Papers like the Apple flash paper are written for an audience comfortable with the notation and background assumptions of the field. For an engineer trying to translate paper techniques into working code, the gap between a theorem about I/O efficiency and a specific flag in llama.cpp is not always obvious. The LLM acts as a translation layer, not to generate the implementation, but to make the paper’s claims legible enough to reason about the implementation independently.

The limitation is that this workflow depends on the model having relevant training data about both the paper and the tooling. For a paper from late 2023 and tooling that has evolved since, the model’s knowledge may be incomplete on specific API details. Treating LLM output as a starting point for investigation rather than ground truth is the correct posture, and Willison’s framing reflects that.

What This Requires in Practice

Running Qwen 397B locally via flash offloading is feasible given the right hardware, but the word “feasible” is doing work there. The performance gap relative to a GPU-resident smaller model is real. You will get coherent output, but at throughput that reflects the NVMe bandwidth constraint rather than GPU memory bandwidth.

The more interesting implication is architectural. MoE models are not simply larger dense models with a different label; their sparsity patterns interact differently with memory hierarchies. The Apple paper’s techniques generalize from dense to MoE, but the structured sparsity of MoE routing makes that generalization more favorable. As MoE architectures become more common at frontier scale, the feasibility of flash-offloaded local inference improves alongside them. Their sparsity patterns ask progressively less of the memory hierarchy per token, and that trend favors local inference independent of hardware improvements.

Was this interesting?