Flash Memory, 397B Parameters, and the Arithmetic of Local LLM Inference
Source: simonwillison
The setup is a little recursive. Simon Willison used his LLM-assisted research workflow to dig into Apple’s “LLM in a Flash” paper while exploring whether Qwen 397B could run on local hardware. His writeup is worth reading for the methodology as much as the conclusions, because the autoresearch pattern he has refined over the past few years is genuinely useful for navigating dense machine learning literature. But the technical questions underneath are more interesting, because applying a paper designed for 7B models on iPhones to a 397B parameter model is a different class of problem with different constraints.
What the Paper Actually Does
Apple’s “LLM in a Flash: Efficient Large Language Model Inference with Limited Memory” (arXiv:2312.11514, December 2023) addresses a specific bottleneck: running a model whose parameters exceed available DRAM. The authors, a team from Apple including Keivan Alizadeh and Iman Mirzadeh, were thinking about iPhones and iPads with 6-8 GB of DRAM, not developer workstations.
The basic idea is that you store model weights on NAND flash (SSD or soldered storage) and load them into DRAM on demand during inference. The obstacle is bandwidth. DRAM on Apple Silicon does 100+ GB/s. NVMe flash on the same machines reads sequentially at around 7 GB/s and handles random small reads far worse, closer to 0.1 GB/s for 4KB blocks. If you naively load every weight for every forward pass, a 7B float16 model that takes 0.14 seconds per token in DRAM takes several seconds per token from flash.
The paper’s solution has three main components.
First, they exploit sparsity in the FFN (feed-forward network) layers. Transformer architectures that use ReLU activations are naturally sparse: for any given input token, roughly 90-95% of neurons in each FFN layer produce zero output. The paper trains a small predictor network that fits in DRAM and estimates, before the main forward pass, which neurons will actually activate. The system then skips loading weights for dormant neurons entirely. The predictor adds around 1-2% overhead but avoids transferring 90%+ of the FFN weight data per token.
Second, they restructure how weights are laid out on flash to enable large sequential reads. Flash memory handles sequential reads of 512KB-1MB blocks efficiently; random 4KB reads at the same total data size can be 100x slower. The paper coalesces logically non-contiguous weight rows (the ones the predictor flags as needed) into contiguous on-disk storage. You transfer slightly more data than strictly necessary, but in far fewer, larger requests.
Third, they cap DRAM requirements for the attention KV cache by using a sliding window rather than keeping all past key-value pairs in memory. This prevents KV cache growth from dominating DRAM even for long sequences.
The combined result is a reported 4-5x throughput improvement over naive flash loading, getting a 7B model to 5-10 tokens per second on M2-class hardware with limited DRAM.
The Bandwidth Math for 397B
Qwen 397B is a Mixture-of-Experts (MoE) model from Alibaba’s Qwen team, with 397 billion total parameters and approximately 30 billion active parameters per forward pass. That ratio is the key number, and it matters enormously for flash inference.
In float16, 397B parameters occupy around 794 GB. In 4-bit quantization, that drops to roughly 200 GB. If you had to load all 200 GB per token, no consumer hardware could handle it: even at 7 GB/s sequential bandwidth, that is 28 seconds per token before any compute.
MoE routing changes the picture significantly. Each token is routed to a small subset of expert FFN modules, typically 2-8 experts out of 64-128 total. With 30B active parameters per token, you are loading something closer to 15 GB in 4-bit per forward pass, not 200 GB. That gets you to the 2-3 second per token range at NVMe bandwidth, which is still slow but starts to be in the territory of usable for non-interactive tasks.
The Apple paper’s predictor network adds a second layer of sparsity on top of the MoE routing, skipping dormant neurons within each active expert. If the within-expert sparsity from the predictor matches what the paper reports for dense models (90%+), then MoE routing plus neuron-level sparsity could reduce per-token data loading by another 10x. Whether that holds across MoE experts with their different weight distributions is an open question, and the paper does not address it.
Apple Silicon’s flash controller also behaves differently from PC NVMe. Apple’s implementation achieves around 10-15 microseconds of latency for random reads, versus 50-100 microseconds on typical PC NVMe drives. The predictor-based approach generates many read requests per token (one per non-dormant neuron cluster), so latency compounds across them. The lower per-request latency on Apple’s controller matters more than the raw sequential bandwidth number for this workload pattern.
The Tools Currently Available
Neither of the two dominant local inference tools, llama.cpp and Apple’s MLX framework, implements the predictor network approach from the paper.
llama.cpp handles oversized models by memory-mapping weights with mmap(), letting the OS page cache handle loading from disk. When model size exceeds DRAM, the kernel evicts cold pages and loads hot ones on demand. This works but is reactive: the OS has no knowledge of neuron activation patterns and cannot prefetch the right weights in advance. For a 397B model on a machine with 192 GB of unified memory (an M3 Ultra configuration), the model fits entirely in memory as 4-bit weights with room to spare, which sidesteps the flash bottleneck entirely for that hardware tier.
MLX takes a more structured approach. Its lazy evaluation model compiles operations into Metal compute graphs that fuse operations and minimize redundant memory bandwidth. The mlx-lm library handles quantized inference for Qwen and most major model families, with 4-bit conversion via mlx_lm.convert:
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit --prompt "Explain flash memory inference"
On an M2 Pro with a 7B model fully in DRAM, MLX at 4-bit reaches around 55 tokens per second, compared to llama.cpp’s approximately 35 tokens per second on the same hardware. The gap comes from better fused Metal kernels and the unified memory architecture removing explicit CPU-GPU data transfer overhead.
The Apple paper’s predictor network technique remains unimplemented in public tooling. Apple published the research but the implementation went into their operating system software, not an open-source library. Building it from scratch requires training a predictor network for the specific model architecture and integrating it with a custom weight loading pipeline, which is a non-trivial engineering effort for any particular model family.
The Autoresearch Layer
Willison’s approach to reading this paper illustrates what his llm CLI tool is actually useful for. The workflow is roughly:
# Pull paper text and ask targeted questions
curl -s https://arxiv.org/pdf/2312.11514 | llm "what does the predictor network in this paper do and how is it trained"
# Chain lookups across related work
llm -s "you are helping me understand LLM inference optimization" "how does this compare to llama.cpp's mmap approach" < paper_notes.txt
The tool does not replace reading the paper. What it does is compress the orientation phase: you can locate the key technical claims, figure out which sections to read carefully, and identify the gaps between what the paper demonstrates and what your specific use case requires. For a 12-page ML paper with dense notation, getting the structure in 30 seconds versus 20 minutes is meaningful.
The recursive quality here is worth noting: Willison is using a language model to understand a paper about running language models with less memory. The autoresearch workflow surfaces the structure of the technical argument quickly enough that he can then identify where the paper’s assumptions (a 7B dense model, ReLU activations, a mobile device) diverge from the target scenario (a 397B MoE model, on a Mac).
Where the Practical Ceiling Is
Running Qwen 397B locally at conversational speeds today requires hardware where the model fits in memory, meaning the flash inference techniques are not yet the binding constraint. An M3 Ultra with 192 GB of unified memory can hold a 4-bit quantized 397B MoE model if the full weight set is close to 200 GB, which is right at the edge. In that configuration, most inference happens in DRAM with no flash loading at all.
For flash-based inference to become relevant for a model this size, you need either hardware with less than 200 GB of memory (most of it), or a model large enough that even 192 GB is insufficient. That gap is closing from both directions: flash bandwidth has roughly doubled every three to four years, quantization methods keep improving (recent 2-bit quantization schemes like IQ2_XXS reduce 397B to under 100 GB at significant quality cost), and MoE models keep growing while maintaining relatively fixed active parameter counts.
The combination the Apple paper does not explicitly explore is MoE routing plus predictor-based neuron sparsity within each expert. MoE routing already selects 5-15% of the model per token; the predictor network then selects 5-10% of each expert’s neurons. The compounding effect on required flash bandwidth is substantial, and it is the part of this research space that has the most room to run.