Running 397 Billion Parameters Locally: What Apple's Flash Inference Paper Actually Enables
Source: simonwillison
Simon Willison recently published a post about using an LLM to research Apple’s “LLM in a Flash” paper in the context of running Qwen’s 397B model locally. The autoresearch framing is interesting on its own, but the underlying technical question is the one worth unpacking: what does it actually take to run a 397-billion-parameter model on consumer hardware, and what does the Apple paper contribute to that problem?
The Memory Wall
Before anything else, the scale deserves a moment of context. A 397B-parameter model stored in 16-bit floating point takes roughly 794GB of memory just for the weights, before you account for the KV cache, activations, or anything else. A Mac Studio Ultra with 192GB unified memory is the ceiling of what most individuals can reasonably purchase. Consumer GPU setups top out at 24GB per card, and even with four cards you’re at 96GB. At standard precision, 397B is categorically out of reach for local inference without some technique to bridge the gap.
There are two well-established approaches to reducing that footprint: quantization and partial loading. Quantization compresses the weights themselves, trading precision for size. At 4-bit quantization, a 397B model drops to roughly 200GB, which remains beyond typical RAM but is no longer astronomically so. Partial loading is the other lever: instead of holding the entire model in memory simultaneously, you load the parts you need and evict the parts you don’t.
The Apple paper is squarely in that second category, but with a specific and clever observation about where the sparsity actually comes from.
The Core Insight: FFN Sparsity
Apple’s “LLM in a Flash: Efficient Large Language Model Inference with Limited Memory” identified that feed-forward network layers in transformer models exhibit significant activation sparsity. For ReLU-activated FFN layers specifically, only a fraction of neurons fire for any given input token. In practice this can be as low as 5-10% of neurons activating at once, which means 90-95% of the FFN weights contribute nothing to that particular forward pass.
The naive version of exploiting this is simple: don’t load weights you won’t use. But flash storage has access patterns that make random reads expensive. If you try to load individual neuron weights on demand, the overhead of hundreds of tiny random reads to an SSD buries any savings from skipping inactive neurons. The paper’s real contribution is in how it structures access to avoid that problem.
Two techniques do most of the work. The first is a lightweight predictor network that runs before the main FFN computation and outputs a probability distribution over which neurons will activate. This lets the system know in advance which weights to fetch, turning random reactive reads into predictable prefetch operations. The second technique is bundling: rather than reading individual rows or columns of the weight matrix, the paper proposes grouping multiple rows and columns together so each SSD read retrieves contiguous blocks of weights that are likely to be needed together. This trades some unnecessary weight loading for dramatically better sequential access patterns, which is the regime where flash storage performs well.
On Apple Silicon specifically, the unified memory architecture and the fast flash bandwidth of M-series chips make this particularly effective. The paper demonstrated running models roughly twice the size of available DRAM at usable speeds, which at the time was a meaningful result for on-device inference.
Why MoE Models Are the Natural Target
The LLM in a Flash technique was demonstrated on dense models, but the argument for applying it to mixture-of-experts architectures like Qwen 397B is even stronger.
MoE models have sparsity baked into the architecture by design. A model like Qwen’s 397B total parameter count likely has a much smaller active parameter count per forward pass, perhaps in the 30-50B range depending on the number of experts selected per token. The routing mechanism selects a small subset of expert networks for each token, and the rest of the experts sit dormant. For inference, this means that at any moment, you only need the weights of the currently selected experts in memory.
This is structurally the same problem the Apple paper addresses, with the added benefit that MoE routing decisions are made by a lightweight gating network that produces a hard selection rather than requiring a predictor model. You know exactly which expert weights you need before you compute the expert layers, making the prefetch problem cleaner than the ReLU sparsity case where neuron activation is input-dependent and less predictable.
With an NVMe SSD achieving sequential reads of 7-10GB/s on a modern machine, and with only a fraction of expert weights needed per forward pass, the arithmetic for local MoE inference becomes more tractable than it first appears. The bottleneck shifts from total model size to the latency of each token generation step.
What the Tooling Looks Like
Llama.cpp has supported partial GPU offloading for some time through its -ngl (number of GPU layers) flag, which lets you keep some transformer layers in VRAM and compute the rest on CPU from RAM. This is a coarser version of the same idea: keep the hot path in fast memory and tolerate slower access for the rest. The llama.cpp project supports GGUF-quantized models and handles the memory mapping and layer offloading mechanics.
For a 397B MoE model in Q4 quantization at around 200GB, a machine with 128GB RAM and a fast NVMe drive is a plausible setup. The model loads partially into RAM, and the SSD backs the rest. Tokens will come slowly, perhaps a few per second, but the model runs. For non-interactive tasks like batch analysis or summarization, that’s usable.
Ollama abstracts some of this with automatic detection of available hardware resources, though its handling of very large MoE models is still maturing. The community around GGUF quantization has been aggressive about producing and testing quantized versions of major model releases, so Qwen 397B likely has community-produced quantizations available through Hugging Face shortly after its release.
The Autoresearch Angle
Willison’s framing is that he used an LLM to research this topic for him, reading the Apple paper and synthesizing the relevant implementation details. This is a workflow he has written about before: using language models as research accelerants, particularly for dense academic papers where the gap between abstract and practical understanding is high.
The irony of using a locally-running or API-accessed LLM to figure out how to run a larger LLM locally is not lost. But it points at something real about how the research-to-practice pipeline works now. The Apple paper is technically detailed and written for an ML research audience. The gap between reading it and knowing what to actually install and configure is substantial. Having a model that can answer “okay but how do I actually try this” questions against the paper content is genuinely useful, even if the answers require verification.
This is also a good stress test for the research model. Papers like LLM in a Flash are technically specific, with important distinctions between the idealized algorithm and what any given inference runtime actually implements. A model that can navigate that gap accurately is more useful than one that confidently blurs it.
What to Expect in Practice
Running Qwen 397B locally via flash-backed inference is a project, not a quick setup. The practical checklist looks something like this:
- A machine with at least 64-128GB of RAM to hold the most active portions of the model
- A fast NVMe SSD with at least 256GB free for the GGUF model file
- A quantized version of the model (Q4_K_M is a reasonable balance of quality and size for a model this large)
- Llama.cpp built from recent source, which continues to improve MoE support
- Patience: at pure CPU inference speeds, generation will be measured in seconds per token for a model this size
For most practical tasks this means inference times measured in minutes for responses of any length. It is not the same experience as a 7B model running on a gaming laptop. What it is, though, is access to a model at a capability level that previously required cloud API access with its associated cost, latency, and data privacy considerations.
The combination of MoE architectures that inherently need only a fraction of their weights active at once, flash storage with bandwidth that keeps improving with each NVMe generation, and quantization that compresses weights to practical sizes, is steadily eroding the hardware ceiling for local inference. The Apple paper is one piece of that, and applying its ideas to a 397B model is a reasonable next step in seeing how far the technique scales.