The 35B/3B Split: What Qwen3's MoE Architecture Actually Changes About Local Inference
Source: simonwillison
The pelican test is simple to describe and hard to fake. Simon Willison has been running it across model releases for years as a consistent qualitative snapshot: ask the model to draw a pelican in SVG, then look at what comes out. The output either looks like a bird or it doesn’t.
His article from April 16 records the result of running it on Qwen3.6-35B-A3B, a locally-runnable open-weight model from Alibaba’s Qwen team. The local model produced better output than Claude Opus 4.7. That result is worth examining in detail.
What SVG Drawing Actually Tests
SVG generation works as a benchmark because it requires several capabilities to converge at once. The model has to hold a working concept of what a pelican looks like, translate that knowledge into abstract coordinates and path data, produce structurally valid XML, and maintain visual coherence across the full output. A failure in any of those areas produces output that either won’t render at all or renders as something geometrically incoherent.
A typical prompt looks like this:
Draw a pelican as an SVG image. Output only valid SVG code.
What comes back tells you about code generation quality, spatial reasoning, and whether the model has a stable internal representation of “pelican” as a visual concept. Stronger models produce recognizable birds: the distinctive pouched bill, the thick neck, proportionate wings and feet, possibly a waterline or background context. Weaker models produce a cluster of ellipses and rectangles arranged vaguely in the shape of something.
The test resists benchmark contamination in a way that MMLU or HumanEval do not. There is no fixed correct answer to memorize, no training set of labeled pelican SVGs to overfit against. The quality of the output reflects generalization, not recall.
The Architecture: Why 35B Doesn’t Mean What It Used To
The model name encodes two numbers that tell different stories. 35B is the total parameter count. A3B means approximately 3 billion parameters are active during any given forward pass through the network.
This is the Mixture of Experts architecture. In a standard dense transformer, every token that passes through a feed-forward layer engages every weight in that layer. In a MoE model, each feed-forward layer is replaced by a pool of N expert sub-networks plus a learned router. The router takes the current token’s representation and selects the top K experts to activate, routing computation only through those paths.
# Dense FFN: all weights participate every time
def dense_ffn(x, W1, W2):
return W2 @ relu(W1 @ x)
# MoE FFN: router selects K of N experts per token
def moe_ffn(x, experts, router):
scores = softmax(router(x)) # shape: [num_experts]
top_k_idx = topk(scores, k=2) # select 2 of N experts
output = sum(
experts[i](x) * scores[i]
for i in top_k_idx
)
return output
For Qwen3.6-35B-A3B, the ratio works out to roughly 10:1 across the MoE layers. The compute cost per token scales with the active parameter count, not the total. That means each token costs approximately as much to process as a 3B dense model, while the model itself was trained with access to 35B parameters worth of representational capacity, distributing knowledge across many specialized experts over the course of training.
The practical consequence is that you get the generalization of a large model for the inference cost of a small one. Benchmark results from the Qwen team confirm this pattern: Qwen3.6-35B-A3B scores in ranges comparable to dense models significantly above its active parameter count.
The Local Inference Picture
The important caveat is that compute and memory are separate concerns. The 10:1 compute reduction from MoE does not reduce the memory footprint. You still need to hold all 35B parameters in RAM or VRAM, because the router selects different experts for different tokens and you cannot know in advance which experts will be needed for any given sequence.
With 4-bit quantization via the GGUF format, 35B parameters compress to approximately 18-20 GB. That fits within the unified memory of a 24GB MacBook Pro or Mac Studio, or within the VRAM of an RTX 4090. For most developers working on recent Apple Silicon hardware, this model is accessible without any additional hardware investment.
The comparison with a dense model of equivalent capability clarifies why this matters. A dense 70B model at 4-bit quantization requires around 40 GB, which forces either a multi-GPU setup or significant offloading to system RAM. Offloading degrades throughput substantially; the bottleneck shifts from compute to PCIe bandwidth. Qwen3.6-35B-A3B at comparable capability fits in half the memory and generates tokens faster on the same hardware, because each token only traverses a 3B-equivalent compute path.
On Apple Silicon hardware with 48GB unified memory, the throughput difference between a dense 70B and a 35B MoE is significant in practice. The MoE model generates tokens faster not because of any software optimization, but because the arithmetic in each forward pass is genuinely reduced. Ollama and llama.cpp both support the GGUF format through which Qwen models are distributed, so the setup path is straightforward.
Contextualizing the Pelican Result
Simon Willison has run the pelican test across dozens of model releases over several years. He has a calibrated sense of what good output looks like, built from longitudinal comparison rather than a fixed reference point. When he says a local model produced a better pelican than Claude Opus 4.7, the comparison reflects genuine familiarity with both ends of that range.
Claude Opus 4.7 sits at the top of Anthropic’s current Claude 4 API catalog. It is not a lightweight or cost-reduced offering; it is the most capable model Anthropic publishes for general use. Accessing it requires an API key, incurs per-token costs, and routes prompts through Anthropic’s infrastructure. Qwen3.6-35B-A3B runs on hardware you own, produces output that never leaves your machine, and costs nothing per query beyond electricity.
The Qwen team at Alibaba has been releasing capable open-weight models consistently since Qwen2, with each generation pushing further into territory previously held only by closed frontier APIs. The MoE architecture has been central to that progress; it allows them to publish models with high capability ceilings that remain practical on consumer hardware. DeepSeek’s work through 2024 and 2025 demonstrated a similar pattern in reasoning tasks, and Qwen3 extends it into creative and generative domains.
The pelican result is one qualitative data point from one evaluator on one task, not a comprehensive capability benchmark. But SVG generation is not a narrow synthetic test either: it requires visual knowledge, spatial reasoning, code generation, and multi-step structural coherence. A model that outperforms Opus 4.7 on that task is demonstrably capable across a meaningful range of abilities.
What the Shift Actually Means
The practical implications change when locally-runnable models become hard to distinguish from expensive API-accessed ones. Use cases that were previously gated behind per-token cost or rate limits become things you can run in tight loops, on private data, with no network latency and no usage meter.
For workflows involving code generation, document processing, or any task where you want to run many queries without tracking cost per call, a locally-runnable model that approaches frontier quality changes the economics entirely. The SVG pelican is a narrow demonstration, but the architecture that makes it possible applies equally to code completion, data extraction, structured generation, and local agent workflows where you want to call a capable model hundreds of times without thinking about the bill.
The hardware requirement for Qwen3.6-35B-A3B covers most modern developer workstations and a substantial share of the current MacBook Pro and Mac Studio lineup. This is not an aspirational capability for future hardware; it is something you can set up today with Ollama in a few minutes if you have the right machine.
The broader trajectory here is one where the open-weight ecosystem has moved from “competitive on benchmarks” to “better than Opus on the pelican test on my laptop.” That progression has been faster than most estimates from two years ago suggested, and Simon’s longitudinal documentation of individual test results, accumulated with consistent methodology across many releases, remains one of the cleaner ways to actually track it.