· 6 min read ·

The Storage Tier That Changes What 'Running Locally' Means for 397B Models

Source: simonwillison

The number that makes this interesting is 397 billion parameters. That is the scale of the Qwen model that Simon Willison recently got running locally by applying techniques from Apple’s research into flash-based LLM inference. The memory math on what that actually requires is worth working through carefully before getting to the technique itself.

At 4-bit quantization, a 397B parameter model compresses to roughly 200-220GB of weight data. Most workstations max out at 128GB of RAM, and that requires expensive ECC server memory on enthusiast platforms. Consumer GPU VRAM tops out around 80GB for the highest-end cards, and stacking multiple cards gets you there eventually but at a cost that stretches the definition of “local.” The conventional assumption has been that models at this scale simply require cloud infrastructure. What Apple’s paper challenges is whether that is a hardware constraint or just an assumption about which hardware tier counts.

Apple’s LLM in a Flash Paper

Apple published “LLM in a Flash: Efficient Large Language Model Inference with Limited Memory” in late 2023. The central insight is that modern NVMe drives are fast enough to act as a secondary memory tier for LLM inference, provided you are selective about what you read and when.

Naive flash-based inference breaks down quickly. A forward pass through a large language model touches weights across many layers in a pattern that looks essentially random to the storage controller. Random 4KB reads on even a fast NVMe drive are orders of magnitude slower than sequential reads, and random access patterns defeat hardware prefetching entirely. If you simply let the OS page weights in on demand, you get punishing latency per token.

The paper proposes two complementary techniques to address this:

Windowing maintains a cache of recently used weights in fast RAM. LLM inference has genuine temporal locality: attention layers repeatedly reference similar context positions, and certain weight matrices activate in patterns that persist across tokens. A properly sized and tuned cache converts many flash reads into RAM hits, which collapses a large fraction of the latency.

Row-column bundling exploits activation sparsity. In networks with ReLU-family activations, many neurons produce zero output for any given input. If you can predict which neurons will fire before fetching their weights from flash, you can bundle those neurons’ weight rows and columns into a single sequential read rather than many separate random ones. The prediction does not need to be perfect; even a moderately accurate sparsity predictor converts enough random I/O into sequential I/O to produce significant throughput gains.

Combined, the paper reports 4-5 tokens per second on Apple Silicon for models roughly twice the available unified memory capacity, with 4-7x improvement over naive flash loading. Those numbers come from the paper’s own benchmarks on smaller models; a 397B model is a different scale entirely, and extrapolating requires care.

Qwen 397B and Mixture of Experts

The Qwen model family from Alibaba has scaled substantially across successive releases. At 397B parameters, there is a strong architectural reason to suspect this is a Mixture of Experts model rather than a dense transformer. MoE architectures at this scale have total parameter counts in the hundreds of billions while activating only a fraction of those parameters per token, often 30-50B active parameters out of the full parameter set.

This matters enormously for flash offloading. If only a fraction of the weights are active for any given token, the effective I/O load per token drops proportionally. A 397B MoE model with 10% activation density has roughly the same per-token weight-read requirement as a 40B dense model. Activation sparsity is built into the architecture rather than something you are predicting probabilistically, which makes the row-column bundling technique far more effective and far more predictable.

The specific activation function matters too. Modern large models favor SwiGLU activations, which combine a sigmoid gate with a linear component. SwiGLU produces sparser outputs than a linear layer but is less predictably sparse than plain ReLU. Activation prediction for row-column bundling works with a noisier signal in SwiGLU-based models, which reduces gains compared to the paper’s best-case scenarios. MoE routing, by contrast, gives you a hard, predictable decision about which expert weights to load.

The Autoresearching Method

What Willison’s post documents is as much a methodology as a result. He used LLMs, including his own llm command-line tool, to research the Apple paper itself: extracting implementation requirements, identifying which hardware assumptions the paper makes, and bridging the gap between “we achieved 4-5 tokens per second on Apple Silicon” and “here is how to set this up for Qwen 397B on an x86 workstation.”

This self-referential loop, using LLMs to understand how to run LLMs better, is becoming a practical engineering workflow. The Apple paper is dense, written for an audience that already understands memory hierarchy optimization and sparse linear algebra. The gap between the paper’s claims and a working implementation on specific hardware involves details the paper does not fully spell out: what quantization formats are compatible with selective layer loading, how the windowing cache interacts with context length, whether the activation predictor needs to be fine-tuned per model family.

Feed the paper to a capable model and ask it to enumerate implementation blockers for a specific hardware configuration, and you get a significantly faster path to a working prototype than working through the paper section by section. Simon has used this approach across multiple technical research projects, and the llm tool’s ability to chain prompts and accumulate extracted context makes it particularly well-suited to the task.

Runtime Support and Hardware

The flash offloading technique requires runtime support that does not yet exist uniformly across the local inference ecosystem. llama.cpp uses mmap for weight loading, which lets the operating system page weights in from disk on demand. This is related to the Apple technique but meaningfully different: OS paging does not perform activation prediction or row-column bundling, it loads pages when they are accessed, producing exactly the random access pattern the Apple paper is designed to avoid.

Full implementations of the paper’s approach require custom memory management that predicts which weight chunks to prefetch before they are needed. This is non-trivial engineering. The gap between what the paper describes and what available runtimes implement is real, and part of what makes Willison’s result interesting is navigating that gap to something functional.

On the hardware side, NVMe drive selection matters more than it typically does for workloads that fit in RAM. PCIe 5.0 drives now achieve 12-14 GB/s sequential read speeds, roughly double what PCIe 4.0 drives offered. For flash-offloaded inference, sequential read throughput is the binding constraint once the windowing cache and activation prediction are working correctly. A faster drive translates directly to lower per-token latency. The drive being the performance bottleneck, rather than the CPU or GPU, is an unusual situation that requires rethinking the standard hardware upgrade calculus.

What the Practical Numbers Look Like

For a 397B model on a system with 64GB of RAM and a fast PCIe 5.0 NVMe drive, realistic throughput lands somewhere in the 0.5-2 tokens per second range, depending heavily on the model’s MoE routing density and how well the activation predictor is tuned. That is not conversational speed. It is functional for document processing, batch generation runs, overnight research tasks, or any workload where generation quality matters more than generation speed.

The structural implication is worth taking seriously. If the binding constraint for local model size shifts from RAM or VRAM to storage capacity, the economics change substantially. You can add 4TB of fast NVMe storage for a few hundred dollars, and that storage is shared with other workloads and reusable as models change. Adding GPU VRAM is expensive, specialized, and non-fungible. A 397B model that runs locally at 1 token per second on commodity hardware occupies a different position in the local AI landscape than a 70B model that runs at 30 tokens per second; the quality ceiling has moved, and the cost to reach it has not scaled proportionally.

The original Apple paper is worth reading on its own terms regardless of whether you intend to implement the technique. The analysis of flash I/O behavior, the treatment of activation sparsity as a prefetching signal, and the careful reasoning about which memory hierarchy assumptions LLM inference actually relies on all generalize to other engineering problems. It is a good example of academic systems work that is immediately applicable without requiring significant adaptation.

Was this interesting?