Running 397 Billion Parameters Off an SSD: What Apple's Flash Inference Technique Actually Delivers
Source: simonwillison
Simon Willison documented an experiment last week that has a pleasing recursive quality: he used a language model to help him research Apple’s “LLM in a Flash” paper, then applied the paper’s ideas to run Qwen 397B locally. The model assists with understanding the research; the research enables running the model. The technical substance behind this loop is worth unpacking carefully.
The Memory Bandwidth Wall
Autoregressive LLM inference is memory-bandwidth-bound, not compute-bound. Every token generation requires reading all active model weights once from wherever they live. For a 7B parameter model in float16, that is 14 GB of data read per token. Modern Apple Silicon (M3 Max) sustains around 400 GB/s of DRAM bandwidth, which puts the theoretical ceiling at roughly 28 tokens per second for a 7B model. Observed speeds in practice match this almost exactly.
NVMe flash on the same hardware sustains around 7 GB/s for sequential reads. Naive inference from flash alone would yield about 0.5 tokens per second for a 7B model. That 20x gap between DRAM and flash bandwidth is the problem Apple’s paper, LLM in a Flash: Efficient Large Language Model Inference with Limited Memory (Alizadeh et al., December 2023), was written to close.
What the Paper Actually Proposes
The core insight is that LLM inference is sparse. In feed-forward networks with ReLU or SwiGLU activations, typically 70-90% of neurons produce negligible output for any given input token. You do not need all weights in DRAM simultaneously; you need the weights for the neurons that matter right now. The challenge is getting those weights from flash fast enough to matter.
The paper describes three interlocking techniques:
Windowing. Transformer inference is sequential, and the weights used for the previous token are likely relevant to the current one. Rather than evicting weight data from DRAM immediately after each forward pass, a sliding cache of recently-used weights persists in memory. Locality of reference in sequential text generation is high enough that even a modest DRAM window achieves substantial hit rates and eliminates a large fraction of redundant flash reads.
Row-column bundling. Flash storage is efficient for large sequential reads and expensive for small random ones. Individual neuron weights map to single rows or columns of weight matrices, which are small. The paper proposes grouping these into larger read units aligned to flash page sizes, typically 16 KB or more. This converts many small random reads into fewer large sequential reads, exploiting the flash device’s internal parallelism and read-ahead hardware. The paper reports 2-5x throughput improvement from this transformation alone.
Sparsity prediction. Before loading weights for an FFN layer, a small predictor network runs first to identify which neurons will produce non-negligible outputs. Only those weights are fetched from flash. At 90% activation sparsity, this reduces flash I/O by roughly 10x for the FFN layers that dominate model size.
Combined, these techniques deliver 25x faster inference than naive flash loading and come within 4-5x of full-DRAM speed on 7B models. The paper was built around Apple Silicon’s unified memory architecture and fast integrated NVMe, but the principles generalize to any system pairing fast flash storage with an intelligent caching layer.
Where Qwen 397B Changes the Calculation
Qwen 397B, released by Alibaba’s Qwen team in 2025, is not a dense model. It uses a Mixture-of-Experts (MoE) architecture, and that distinction changes the flash offloading story considerably.
In a dense transformer, every token passes through every feed-forward layer’s full weight matrix. In a MoE model, each FFN block contains many expert sub-networks, and a learned router selects only a small subset per token. For a model with 64 experts routing to 8 active per token, only 12.5% of FFN weights are ever read for any given forward pass. The remaining expert blocks sit dormant.
At full precision (bfloat16), Qwen 397B occupies roughly 794 GB, which is infeasible on any single consumer device. At 4-bit quantization (Q4), this drops to around 200 GB. At Q2, around 100 GB. But the MoE architecture means the effective bytes read per token are a fraction of even the quantized total. A 397B MoE model reading only 12.5% of FFN weights per token has an effective per-token memory access profile closer to a 50-60B dense model, while retaining the capability of the full parameter count for the reasoning tasks those parameters support.
This is precisely the kind of structured sparsity that Apple’s paper is designed to exploit. MoE routing is not randomly sparse activations within a dense FFN; it is completely inactive expert blocks whose weights can stay on flash indefinitely. The windowing cache can hold the most recently routed experts in DRAM, since routing tends to be locally consistent across tokens in the same semantic region. The predictor network, in a MoE context, is essentially replaced by the router itself.
What Actually Runs This
Two tools are most relevant for running Qwen 397B locally on Apple Silicon. llama.cpp supports GGUF-format quantized models with mmap-based loading, letting the OS transparently page weights from SSD when DRAM is exhausted. The OS page cache acts as a natural sliding window in the spirit of the paper’s windowing technique, though without the predictor-network sparsity optimization.
MLX-LM, Apple’s own inference framework for Apple Silicon, operates on unified memory and handles quantized model loading with support for streaming weights from the filesystem. On a Mac Studio Ultra with 192 GB unified memory, a Q4 Qwen 397B model fits nearly entirely in DRAM with minimal SSD offloading required, and inference reaches 5-10 tokens per second. On a MacBook Pro with 64-128 GB, Q3 or Q2 quantization brings the model into range with more aggressive SSD offloading, at the cost of 1-3 tokens per second and some quality degradation.
Neither tool fully implements Apple’s bundling and sparsity-prediction optimizations as described in the paper. The potential gain from doing so for MoE models, with their structured and predictable sparsity, is substantial. This is an open engineering gap.
The Autoresearch Workflow
Willison used his llm CLI tool to feed Apple’s arXiv paper through a capable model, extract the key techniques, and build a working understanding before running the experiment. The pattern he calls autoresearching uses an LLM to help you understand the research that makes running LLMs better.
The practical value here extends beyond the recursive appeal. Most people experimenting with local inference are not reading systems papers. The gap between running commands from a README and understanding the memory access patterns behind the tool is significant, and LLM-assisted paper reading lowers the friction enough that more people end up with actionable understanding rather than just a running model.
There is also a secondary compounding effect: when capable models assist in synthesizing inference optimization research, the techniques in papers like LLM in a Flash get faster uptake in tooling and community practice. The Apple paper was published in late 2023. Its core ideas were only beginning to influence consumer-facing inference tools in 2026. Shortening that lag matters for everyone trying to run large models on their own hardware.
Where the Ceiling Sits Now
The combination of flash inference techniques, MoE architectures, and aggressive quantization makes a credible case that models in the 200-400B parameter range can run on high-end consumer hardware. Not fast enough for real-time applications, but fast enough for document analysis, extended reasoning, or batch processing where latency is not the primary constraint.
The numbers that determine feasibility are not the headline parameter counts. They are the active parameters per token (shaped by MoE routing), the quantized weight footprint (shaped by quantization strategy), and the effective flash read volume per token (shaped by caching and sparsity). Qwen 397B looks substantially more tractable than a 397B dense model on all three of those dimensions.
Willison’s experiment is useful documentation of where that ceiling sits right now and which tools get you there. The techniques in Apple’s paper represent a clear roadmap for pushing it further, and the autoresearch workflow he describes is a practical approach to staying current with that roadmap as the inference tooling catches up with the research.