When Your Laptop Beats the Cloud: Qwen3's MoE Architecture and the Fuzzy Edge of Frontier
Source: simonwillison
Simon Willison has been running an informal benchmark for a while now: ask a model to draw a pelican in SVG. It sounds trivial, but generating valid, aesthetically reasonable SVG requires a model to hold together geometry, coordinate systems, shape primitives, and something resembling spatial intuition, all at once. It is a surprisingly revealing test. His recent finding that Qwen3.6-35B-A3B, running locally on his laptop, produced a better result than Claude Opus 4.7 is worth unpacking carefully, because the story is less about pelicans and more about what mixture-of-experts architecture is doing to the local/cloud divide.
The Architecture Behind the Number
The model name encodes everything important: 35B-A3B. Thirty-five billion total parameters, but only approximately three billion active at any given inference step. This is a sparse mixture-of-experts design, the same class of architecture that powers Mixtral, DeepSeek-V2, and several others. Rather than routing every token through the full parameter space, the model uses a learned router to activate only a subset of “expert” feed-forward layers per token.
The practical consequence is dramatic. A dense 35B model needs roughly 70GB of GPU memory in fp16, which rules out most consumer hardware. A 35B MoE model with 3B active parameters needs memory proportional to the active slice, not the full weight count. In practice, Qwen3 35B-A3B can run on hardware with around 20-24GB of VRAM, or even on unified-memory Apple Silicon machines with enough RAM. The “on my laptop” part of Willison’s headline is not a quirk; it follows directly from the architecture.
Qwen’s MoE implementation follows a pattern established by their earlier models and refined across the Qwen3 series. The router is trained jointly with the experts, and the team applies auxiliary loss terms to prevent expert collapse, a failure mode where the router learns to send most tokens to a small subset of experts, negating the benefits of the design. Getting this balance right is what separates functional MoE from the theoretical promise of the architecture.
Why SVG Generation Is a Meaningful Test
Generating SVG is not a standard benchmark category. There are no leaderboard numbers for it. But it tests something that standard benchmarks often miss: the ability to produce structured output where local correctness (valid XML, valid attribute values, numeric coordinates in range) is necessary but not sufficient. A model can generate syntactically valid SVG that looks like noise, or it can generate something that captures the shape of a bird.
The task requires the model to translate a concept into a coordinate system, reason about relative sizes and positions, choose appropriate primitives (ellipses, paths, polygons), and produce output that is parseable and renderable without any feedback loop. There is no retry. The model generates once and the result either looks like a pelican or it does not.
This kind of single-shot structured generation under aesthetic constraints is exactly the domain where model quality differences become visible in ways that multiple-choice benchmarks do not capture. A model that scores 85% on MMLU may still generate SVG that looks like abstract art. A model that scores lower may produce something recognizable. The pelican test is unscientific but it is not uninformative.
Willison has been collecting these outputs across dozens of models for long enough that the comparison has genuine longitudinal value. He is not comparing one output against another in isolation; he is comparing against a baseline he has developed intuition for over many iterations.
The Closing Gap
The interesting question is not whether a local model can beat a frontier API model on a single task. That was always going to happen eventually on some tasks. The interesting question is what the gap looks like now, and how quickly it is closing.
A year ago, the conventional wisdom was that local models were useful for privacy-sensitive workloads, offline scenarios, and cost-sensitive applications, but that they trailed frontier models by enough that you would notice the difference in quality-sensitive tasks. That gap is clearly narrowing. Whether it has closed completely depends on the task, and SVG generation is one place where a well-tuned local model at a fraction of the compute cost can match or exceed what the API delivers.
Part of the reason is that frontier model providers are optimizing for breadth. A model like Opus 4.7 needs to perform well across coding, reasoning, tool use, long-context retrieval, multilingual tasks, and dozens of other categories. A local model that is fine-tuned or selected for a narrower profile can concentrate its capacity differently. Qwen3 35B-A3B was not specifically trained to draw pelicans, but the Qwen3 family’s training emphasis on instruction following and structured output generation aligns well with what SVG generation requires.
What MoE Changes for Local Inference
The shift from dense to sparse models matters practically for anyone building on local inference. Frameworks like Ollama, llama.cpp, and LM Studio all support GGUF-quantized MoE models, though with varying degrees of optimization for the routing overhead. Sparse models have a slightly different performance profile than dense models: the router adds a small amount of computation per token, and cache locality suffers somewhat because different tokens activate different experts. On well-optimized implementations, this overhead is modest.
For llama.cpp specifically, MoE support has matured considerably. The key parameters to watch are --n-gpu-layers for offloading expert layers to GPU, and memory-mapped loading, which MoE models benefit from because the inactive experts do not need to be resident in fast memory during inference. A properly configured Qwen3 35B-A3B instance on a machine with 24GB VRAM and 64GB system RAM can keep hot experts on GPU and page cold ones from RAM with acceptable latency.
Quantization interacts with MoE in ways that are still being studied. Standard Q4_K_M quantization applied uniformly across all experts works, but some work suggests that quantizing the router at higher precision than the expert weights improves output quality, since router errors compound across layers. The Qwen3 team has published GGUF variants with varying quantization profiles; Q5_K_M tends to hit a reasonable accuracy-size tradeoff for the 35B-A3B variant.
What This Means for How You Build
If you are building something that calls an API model for a task like structured output generation, SVG rendering, or other single-shot creative-but-constrained tasks, it is worth benchmarking against local alternatives before assuming the API is the right answer. The economics are different: API calls have per-token costs that compound at scale, while local inference has upfront hardware costs and operational overhead, but marginal cost per inference approaches zero.
The more interesting implication is for the definition of “frontier.” Frontier used to mean the best models from the largest labs, full stop. It is becoming something more like “the best models for tasks that require broad generalization or very long context.” For tasks with a narrower profile, a well-chosen local model running on commodity hardware is increasingly competitive. The pelican result is one data point, but it fits a pattern that has been developing across a range of benchmarks and informal tests.
Qwen3’s trajectory in particular is worth watching. The Qwen team at Alibaba has been releasing models at a pace and quality level that consistently surprises, and the 35B-A3B variant represents a real engineering achievement: a model that fits in a practical local inference budget while delivering results that hold up against frontier API offerings on specific tasks.
The gap between what runs on your laptop and what requires a data center is not gone, but it is getting specific enough to navigate. Knowing which tasks fall on which side of that line is becoming a genuinely useful engineering skill.