· 5 min read ·

When 3 Billion Active Parameters Outdraws a Frontier Flagship

Source: simonwillison

Simon Willison ran Qwen3.6-35B-A3B on his laptop and asked it to draw a pelican in SVG. The output was better than Claude Opus 4.7’s. That sentence deserves more unpacking than the headline gives it.

This is not a story about one informal test. It’s a story about an architectural shift in open-weight models that has been building for over a year, and about what that shift means now that it’s running at 40-plus tokens per second on consumer hardware.

The Architecture Behind the Result

Qwen3.6-35B-A3B is a Mixture-of-Experts model. The naming encodes the key fact: 35 billion total parameters, approximately 3 billion active per forward pass. The “A3B” suffix is the MoE convention for “active 3 billion.”

The way MoE works is that the feedforward layers (which make up the bulk of a transformer’s parameters) are divided into many independent expert sub-networks. A learned routing function selects a small subset of these experts for each token. The unselected experts sit in memory but contribute nothing to the forward pass. You load the full weight set once, but each inference only routes through a fraction of it.

The practical consequence is that Qwen3.6-35B-A3B reasons with the knowledge stored in 35 billion parameters while paying the compute cost of a roughly 3-billion-parameter model per token. On Apple Silicon with mlx-lm, that translates to around 40-60 tokens per second, which is roughly twice what a comparable dense 30B model achieves on the same hardware. At Q4_K_M quantization the model occupies 18-22 GB, which puts it within reach of any MacBook Pro with 24 GB of unified memory.

The Qwen3 family, released in April 2026, includes two MoE variants: this laptop-class 35B-A3B and a server-grade 235B-A22B. The dense models in the family range from 0.6B to 32B. All carry an Apache 2.0 license, which means unrestricted commercial use. Ollama exposes the model as qwen3:30b-a3b (the slight naming difference is Ollama’s convention, not a different model).

This is the third generation of Qwen. Qwen2.5-72B in late 2024 was already broadly GPT-4-class on coding benchmarks. QwQ-32B, the dedicated chain-of-thought model from the same family, matched or exceeded frontier models on math and logic. The trajectory has been consistent: each generation closes the gap, and Qwen3 is the first to cross it on at least some tasks.

What SVG Drawing Actually Tests

Willison has used the pelican SVG prompt as an informal benchmark across model generations for some time. It’s worth understanding why this is a more demanding probe than it appears.

SVG is coordinate-based. Every number in the output has a direct spatial meaning, and the model has to commit to all of it in a single pass with no visual feedback loop. Drawing a recognizable pelican requires the model to hold a mental model of the bird’s anatomy, translate it into 2D coordinates that make geometric sense relative to each other, and produce well-formed SVG markup simultaneously. There’s no runtime interpreter catching errors or adjusting values. The coordinate relationships are either internally consistent or they’re not, and any failure is immediately visible in the rendered output.

This makes it resistant to the kind of benchmark contamination that plagues formal evals. Training corpora don’t contain examples labeled “SVG pelican, scored for quality.” The test also doesn’t reduce to a single number, which makes it harder to overfit to but easier to interpret when you can see the actual output.

For spatial and coordinate reasoning specifically, it’s a more relevant probe than most multiple-choice benchmarks. A model that can produce a coherent SVG pelican has demonstrated something about how it reasons over constrained 2D space.

The Dual-Mode Design

Qwen3 models include something worth noting separately: a built-in thinking mode and a non-thinking mode, switchable per request. In thinking mode the model engages extended chain-of-thought reasoning before producing output. In non-thinking mode it responds quickly for lower-latency tasks.

This is similar to how Claude’s extended thinking API works, where you enable a thinking budget in the request:

response = anthropic.messages.create(
    model="claude-opus-4-7-20260401",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[...]
)

But Qwen3’s version runs locally, offline, with zero token cost beyond electricity. For tasks where you want reasoning depth, you enable thinking. For tasks where you want fast responses, you disable it. That flexibility, available at zero marginal cost, is a different kind of capability than the same feature behind an API.

Where Opus 4.7 Still Has Ground

Being direct about this: Claude Opus 4.7 still leads meaningfully on specific things. Its 200,000-token context window with coherent long-range reasoning is not what you get from a laptop model. Long-horizon agentic tasks, multi-file code refactors with deep interdependencies, extended reasoning over large corpora, and behavioral consistency from Anthropic’s alignment training are all areas where Opus 4.7 earns its cost.

For my own work building Discord bots and wiring up agentic workflows, the tasks that have pushed me toward Opus rather than Sonnet have consistently been the ones involving many sequential tool calls where the model needs to maintain goal state across a long context. Qwen3.6-35B-A3B running on my laptop does not help me there, at least not yet.

But the SVG pelican result is pointing at something real: on bounded, self-contained tasks that require a specific kind of spatial or structured reasoning, a local MoE model is now competitive with frontier API models. That’s a narrower claim than “local beats cloud,” but it’s an accurate one.

What This Changes for Developers

The practical implication is that the decision tree for choosing a model has gained a new viable branch. Before Qwen3, the local model option was defensible for cost reasons but required accepting a meaningful capability penalty on complex tasks. That penalty is shrinking.

For any developer running classification, structured extraction, code generation for moderate complexity, or generative tasks like SVG or structured document output, running a local Qwen3 MoE model now means: no API latency, no per-token cost, no rate limits, no data leaving the machine, and inference speeds that are usable for interactive workflows. The total parameter count stored in memory is large; the compute per token is small.

The Qwen family has accumulated over 113,000 derivative fine-tuned models on Hugging Face as of early 2026, more than Meta’s Llama derivatives by a wide margin. That ecosystem breadth matters: whatever your task domain, there is likely a fine-tune of this architecture targeted at it.

Willison’s pelican result is one data point. The architecture behind it is the more durable story. MoE is not a research curiosity anymore; it’s how the leading open-weight models are built, and it’s why those models are now fast enough and capable enough to run as practical tools on developer hardware. The gap between what you can run locally and what you can call via API is narrower in April 2026 than it was at any prior point, and the direction of travel is clear.

Was this interesting?