35B Total, 3B Active: How Qwen3.6's MoE Architecture Reached the Laptop
Source: hackernews
Simon Willison ran Qwen3.6-35B-A3B on his laptop and asked it to draw a pelican in SVG. The output compared favorably to Claude Opus 4.7. His writeup is brief, but the architectural story behind why this is even possible deserves more attention than the headline comparison gets.
The model name contains the key detail: 35 billion total parameters, 3 billion active. This distinction between parameters stored in memory and parameters activated per token is what separates mixture-of-experts models from their dense counterparts, and it is what makes a 35-billion-parameter model runnable on a laptop in 2026.
MoE: The Load vs. Compute Split
In a standard dense transformer, every token passes through every parameter in the feedforward layers on every forward pass. A 35B dense model activates all 35B parameters for every single token, requiring hardware capable of both holding those weights and performing that compute.
Mixture-of-experts models split the feedforward layers into N separate sub-networks called experts, governed by a small learned router that decides which K experts to activate for each token. In Qwen3.6-35B-A3B, roughly 35 billion total parameters are distributed across those expert networks, but only about 3 billion are activated per token. The remaining 32 billion sit loaded in memory, addressable but idle, waiting in case the router selects them for a different input.
This creates an unusual hardware profile: you need enough memory to hold all 35 billion parameters so the router can address them, but your compute budget per token is proportional to the 3 billion active ones. On an Apple M-series Mac with 64 or 96GB of unified memory, or a PC with a modern GPU plus system RAM offloading, the full model fits. Once loaded, inference runs at roughly the throughput of a 3B dense model, not a 35B one.
At 4-bit quantization using the Q4_K_M GGUF format, the 35B model loads into approximately 20-22GB. On an M4 Max or M2 Ultra, this is comfortable. On a high-end PC with a 24GB RTX 4090 plus CPU offloading, it works. The practical entry point is an ollama pull command on hardware most developers already own.
Expert specialization is not random. Through training, different experts develop preferences for different input types: some become better at code tokens, others at mathematical notation, others at natural language prose. The router learns to send tokens to the experts best suited for them. This learned specialization is part of why MoE models can match large dense models in output quality while activating far fewer parameters per token, and it is why the effective capability of the model often exceeds what the active parameter count alone would suggest.
The Qwen3 Family
Qwen3 is Alibaba’s third-generation open-weight model series, released in April 2026. The family spans dense models from 0.6B to 32B parameters and two MoE variants: the 35B-A3B (the model Willison ran) and a larger 235B-A22B flagship designed for server-grade deployments. Both MoE variants include a toggleable thinking mode that activates extended chain-of-thought reasoning, similar in spirit to DeepSeek-R1’s approach. You can enable it for reasoning-heavy tasks and disable it for faster, lower-cost completions when the task doesn’t require deep deliberation.
All Qwen3 models are available through Hugging Face and through ollama. The open-weight release means you can run it, fine-tune it, inspect its weights, and deploy it without API rate limits or per-token costs beyond your own hardware.
Why SVG Drawing Is a Useful Test
Formal benchmarks like MMLU, HumanEval, and MATH are useful for tracking aggregate capability, but they carry well-documented contamination risks and can be inadvertently optimized against during training. SVG generation is harder to game.
Drawing an animal in SVG requires the model to reason about 2D coordinate space, plan shape layering and z-order, approximate curves using cubic Bezier paths, and produce well-formed XML that renders correctly in a browser. These are four distinct capabilities that must succeed together. A model that hallucinates path syntax, misplaces the eye relative to the beak, or produces technically valid XML that renders as an undifferentiated blob is not passing the test, regardless of what its MMLU score shows.
Training corpora do not contain SVG renditions of pelicans annotated for aesthetic quality. The model must genuinely construct the image, which means the output reflects actual spatial reasoning and code generation, not retrieval from a training example. The test also has a natural floor and ceiling that make quality differences visible: two SVG pelicans side by side are easy to compare, and the differences are legible to anyone who looks at the output.
Willison has been running pelican-drawing prompts as a running qualitative check across many model releases. When he says Qwen3.6-35B-A3B drew a better pelican than Opus 4.7, he is drawing on accumulated intuition about what good SVG generation looks like across dozens of model comparisons, not making an arbitrary one-time judgment.
The Trend This Fits
The pattern is consistent with roughly 18 months of open model development. DeepSeek-R1, released in January 2025, matched OpenAI’s o1-preview on reasoning benchmarks while being fully open-weight and locally runnable, reportedly trained at a fraction of the cost of comparable US frontier models. Qwen2.5-72B was broadly considered GPT-4-class for practical coding and language tasks. Meta’s Llama 3.1 405B was the first open-weight model to explicitly claim GPT-4-level performance across major benchmarks.
Each of those releases compressed the gap between open models and closed frontier models. Qwen3.6-35B-A3B matching Claude Opus 4.7 on a specific task is not a sudden discontinuity; it is where the trend was heading.
What distinguishes this particular result from earlier milestones is the hardware profile. DeepSeek-R1 at 671B parameters and Llama 3.1 405B are runnable locally in a technical sense, but you need serious, expensive hardware to get practical throughput. A 35B MoE running at effective 3B-density on a modern laptop changes what “locally runnable” means for most developers. The capability is now accessible without dedicated inference infrastructure.
The Caveats
One task, one evaluator. Willison’s comparison is a single prompt judged qualitatively by the person running it. Claude Opus 4.7 almost certainly outperforms Qwen3.6-35B-A3B across many tasks, particularly those involving nuanced long-context reasoning, reliable instruction following across edge cases, or domains where Anthropic’s extensive alignment training adds meaningful behavioral consistency. A better pelican is a data point about SVG spatial reasoning on a specific prompt, not a general capability ranking.
Inference speed and context window also matter in practice. A cloud API call to Opus returns quickly with no hardware startup cost and no memory overhead on your machine. Running a 35B model locally is slower per token, requires holding a large model loaded in memory, and may offer a shorter effective context window depending on your configuration and available RAM. For throughput-sensitive or long-context workflows, the API remains the practical choice.
The comparison also does not address alignment differences. Anthropic has invested substantially in Constitutional AI training and RLHF refinement designed to make Opus predictable, safe, and consistent across a wide range of deployment contexts. Open-weight models have their own alignment approaches, but the behavioral profiles differ in ways that matter for production systems at scale.
The Practical Upshot
The capability tier previously reserved for frontier cloud APIs has reached consumer hardware, at least for specific creative and coding tasks. A developer with a modern MacBook or a well-specced PC running Qwen3.6-35B-A3B now has a tool that, on some tasks, operates at the same quality level as the best cloud models available. That changes the make-vs-buy calculus for AI capabilities in ways that were not true six months ago.
Where latency, privacy, or cost are constraints, a locally-running Qwen3.6-35B-A3B is a credible option where it previously was not. Where breadth of capability, reliable instruction following, and long context are the requirements, cloud APIs remain ahead. The gap between those two use cases is smaller now than at any prior point, and the open model side of that gap is still narrowing.