35 Billion Parameters, 3 Billion Active: What Qwen3's MoE Efficiency Actually Means
Source: hackernews
Simon Willison has been drawing pelicans with language models for a while now. The test is deceptively simple: ask a model to generate an SVG of a pelican. No reference image, no coordinate hints, just prose turned into vector geometry. The model has to understand what a pelican looks like, translate that into a spatial mental model, then emit a sequence of SVG path commands that reconstitutes that shape in pixels. It is a reasonable proxy for a cluster of capabilities that matter: spatial reasoning, procedural code generation, and the ability to maintain internal consistency across many interdependent numeric values.
When Willison reported that Qwen3.6-35B-A3B running on his laptop produced a better pelican than Claude Opus 4.7, the headline read as a curiosity. A local model beating a frontier one at a visual reasoning task. But the interesting part is not the pelican. It is the “A3B” in the model name.
What 35B-A3B Actually Describes
The naming convention maps directly to a Mixture of Experts architecture. The 35B figure is the total parameter count across all expert layers. The A3B figure, “Active 3 Billion,” is how many of those parameters participate in any single forward pass. For every token the model processes, a learned routing network selects a small subset of the available expert layers, and only those experts compute. The rest of the model sits in memory, unused, waiting to be selected for a different token or a different kind of problem.
This is not a new idea. The Mixtral 8x7B model released by Mistral in late 2023 used the same principle: eight expert feed-forward blocks per transformer layer, with two selected per token. The effective compute per token was roughly equivalent to a 12B dense model, despite the total parameter count being closer to 47B. DeepSeek-V2 pushed this further with 236B total parameters and 21B active, achieving inference costs comparable to much smaller models while retaining the representational capacity of something much larger.
Qwen3.6-35B-A3B occupies a similar position. At roughly 3B active parameters, inference cost per token is in the ballpark of a small-to-medium dense model. At 35B total parameters, the model has learned representations across a much wider distribution of tasks and domains than a 3B dense model could fit in its weights.
Why It Runs on a Laptop
The VRAM calculus for MoE models confuses people because total parameter count and inference cost come apart. You still need to load all 35 billion parameters into memory to run the model, since you cannot predict in advance which experts will be needed. At 4-bit quantization, 35B parameters occupy roughly 17-20GB of memory depending on the quantization scheme. That fits comfortably in a MacBook Pro with 24GB or 36GB of unified memory, or on a machine with a high-end consumer GPU.
But compute during generation scales with active parameters, not total parameters. At 3B active, each token generation step costs roughly what a 3B dense model would cost. On Apple Silicon, where memory bandwidth is the primary constraint for autoregressive inference, this translates to meaningfully faster generation speeds than a 35B dense model would produce. You get the knowledge surface of a large model at inference speeds closer to a small one.
This is the practical upshot for anyone who wants to run models locally. The choice is no longer “small fast model” versus “large slow model.” Well-designed MoE models at the 30-40B total parameter scale can offer both reasonable speed and quality that was previously only available from cloud-served frontier models or from dense models too large for consumer hardware.
The Qwen Family’s Trajectory
Alibaba’s Qwen series has moved quickly. Qwen 1.5 established the baseline as a capable multilingual model family. Qwen2 brought stronger reasoning and coding performance, competing with models like Llama 3 at equivalent sizes. Qwen2.5 pushed further with the addition of dedicated math and coding variants, and the 72B dense version competed credibly against much larger models on standard benchmarks.
Qwen3 appears to represent a deliberate shift toward MoE architectures for mid-range models, following the same logic that motivated DeepSeek’s architectural choices: if you can separate parameter count from inference cost, you can train with the data efficiency of a large model while keeping deployment costs low. For a lab trying to compete with OpenAI and Anthropic on quality while maintaining viable inference economics, this tradeoff is attractive.
The “3.6” version identifier in Qwen3.6 likely refers to an iteration within the Qwen3 generation, possibly tuned further for instruction following or specific capability domains. The Qwen team has been consistent about releasing multiple variants per generation, including base models, instruction-tuned versions, and domain-specific fine-tunes.
SVG Generation as a Benchmark
Willison’s pelican test has a specific character that makes it useful for distinguishing models. SVG is a declarative coordinate-based format. A path like M 100 200 C 150 100 250 100 300 200 encodes a Bezier curve where every number has a spatial meaning. To draw a recognizable animal, a model needs to decompose a mental image into geometric primitives, assign plausible coordinates to each feature, and maintain spatial coherence across the entire composition.
This differs from most text-generation benchmarks in that there is no right answer to look up, no training distribution to pattern-match against. The model is synthesizing geometry from semantic knowledge, and the output quality is immediately and visually apparent. A model that produces a recognizable pelican with a bill, a body, and legs in roughly the right proportions is doing something non-trivial. A model that produces a blob with misplaced features or coordinates that collide with each other reveals gaps in its spatial reasoning.
The benchmark is also difficult to game. It is hard to overfit to “draw a pelican” in a way that would inflate scores on a held-out test set. What you see is approximately what the model knows.
What the Benchmark Does Not Tell You
Beating a frontier model at pelican drawing is meaningful, but it is evidence about a narrow slice of capability. Creative and spatial tasks are areas where open-weight models have made faster progress relative to proprietary ones than on, say, multi-step reasoning chains or tasks requiring broad factual grounding. Claude Opus 4.7 is almost certainly stronger than Qwen3.6-35B-A3B at extended reasoning, complex code generation with many interdependencies, or tasks that require synthesizing information from long documents.
The comparison also carries context about what “running on a laptop” costs. Willison’s machine is presumably something with 32-64GB of unified memory or a dedicated GPU with equivalent capacity. That is not an inexpensive laptop, and the inference speed for a 35B MoE model at 4-bit quantization is going to be slower than what a well-provisioned API endpoint returns. For interactive use, the latency difference matters.
Still, the cost comparison is compelling. Cloud API access to a frontier model like Opus accumulates cost at scale. A local model has a one-time hardware cost and zero marginal cost per token thereafter. For a developer running batch processing, generating synthetic data, or prototyping applications where quality at Opus-level is not strictly required, a local model that approaches that quality on specific tasks changes the economics considerably.
The Gap Is Narrower Than the Benchmarks Admit
The broader pattern behind Willison’s observation is that the distance between locally-runnable open-weight models and proprietary frontier models has been compressing steadily for two years. DeepSeek R1 was an early data point in early 2025, demonstrating reasoning performance that matched OpenAI’s o1 at a fraction of the inference cost. Qwen3.6-35B-A3B winning a pelican contest against Opus 4.7 is a later data point in the same trend.
The trend is not uniform across all task types. Frontier models still lead on the most demanding multi-step reasoning tasks, on tasks requiring very long context, and on benchmarks that require broad coverage of specialized knowledge. But for the kinds of tasks that most applications actually need, the gap has become task-dependent rather than categorical.
For someone building Discord bots, running document processing pipelines, or building local development tools, this matters practically. The question shifts from “can a local model do this at all” to “is a local model good enough at this specific task.” Willison’s pelican is a small but concrete data point for how that second question is worth asking more often than it used to be.