Why Qwen3 Can Beat a Frontier API Model at Spatial Tasks While Running on Your Laptop
Source: simonwillison
Simon Willison posted something worth pausing on: Qwen3.6-35B-A3B, running locally on his laptop, produced better SVG output for a pelican drawing prompt than Claude Opus 4.7. Opus 4.7 costs money per token and runs on Anthropic’s infrastructure. Qwen3.6-35B-A3B ran on consumer hardware, presumably with no API cost once the weights are downloaded.
That comparison would have registered as implausible two years ago. Today it requires an explanation, because the architecture that makes it possible is genuinely interesting and has specific implications for anyone who builds tools on top of language models.
The Pelican Test
Willison has used the pelican SVG prompt across many model evaluations. The task is simple to state: generate SVG markup that draws a recognizable pelican. It is harder than it looks because SVG requires the model to reason about coordinates, shapes, proportions, and spatial relationships in two dimensions, then express all of that as declarative XML rather than pixels or raster output.
There are no lookup tables for pelican geometry. The model has to construct something coherent from whatever spatial representations it developed during training. A model that produces a plausible pelican is demonstrating something beyond token prediction over text; it is showing that those representations transfer to a visual construction task that requires holding shape, position, and hierarchy in mind simultaneously.
That is why this test appears repeatedly in informal model comparisons. It sits at an intersection of geometry, artistic judgment, and code generation that standard benchmarks rarely probe. MMLU and GSM8K tell you something about knowledge retrieval and mathematical reasoning. The pelican prompt tells you something different: whether the model can assemble coherent structured output from an underspecified creative brief.
What A3B Actually Means
The model name contains the critical technical fact: 35B-A3B. The 35B is the total parameter count; A3B means approximately 3 billion parameters are active during any given forward pass. This is a Mixture of Experts architecture.
In a standard dense transformer, every parameter participates in processing every token. A 35B dense model uses all 35 billion parameters to compute each output. A 35B MoE model routes each token through a small subset of specialized “expert” feedforward layers, leaving the rest dormant for that token. The routing function is learned during training, so different input patterns activate different subsets of experts. The total parameter count reflects the model’s knowledge capacity; the active parameter count reflects the per-token compute cost.
For Qwen3.6-35B-A3B, that ratio is roughly 12 to 1. You get something with the representational breadth of a 35B model at a compute cost closer to a 3B dense model. That ratio is why this model can run at interactive speeds on a laptop while producing output that competes with frontier API models on certain tasks.
The MoE design in Qwen3 follows the architecture that Mistral’s Mixtral 8x7B established as viable for open-weight models in late 2023, subsequently refined by DeepSeek’s V2 and V3 releases, which pushed MoE efficiency further by improving routing stability and reducing the communication overhead between experts. Alibaba’s Qwen team has iterated on top of that lineage. The 35B-A3B configuration appears to be a deliberate sweet spot between memory footprint and generation quality, following the same logic that made DeepSeek-V2’s 236B-A21B design compelling: large total capacity, controlled inference cost.
Running This on Consumer Hardware
When someone says a 35B-parameter model ran on their laptop, the implied context is quantization and Apple Silicon’s unified memory architecture.
Quantization reduces the numerical precision used to store model weights. A 35B model in 16-bit floating point requires roughly 70 GB of memory. In 4-bit quantization, that drops to around 18-22 GB depending on the scheme, which fits within the unified memory pool of a high-end M-series Mac. Apple Silicon’s memory architecture, which shares DRAM between CPU and GPU without a PCIe bus in between, turns out to suit LLM inference well. Memory bandwidth is the binding constraint for generation speed, and the M-series chips deliver enough of it to sustain reasonable token rates for large quantized models.
The active parameter count matters here too. Even with full weights loaded, a MoE model only touches a fraction of them per token, which keeps the effective working set smaller and helps the inference engine pipeline memory access more efficiently. The 35B-A3B combination lands in a range where llama.cpp and Ollama can sustain usable generation rates on M3 or M4 hardware.
Willison likely ran this through his own llm CLI tool or Ollama, both of which support the Qwen model family and abstract away the quantization details. The tooling has matured to the point where getting a 35B MoE model running locally is closer to a package install than a research project.
The Trajectory This Represents
The comparison that matters is not a static snapshot of Qwen3.6-35B-A3B versus Claude Opus 4.7. It is how quickly that gap has moved.
Eighteen months ago the question was whether any locally runnable model could produce coherent, usable code. The ceiling was 7B or 13B models that were useful for simple tasks but clearly limited on anything requiring nuanced judgment. Qwen2.5’s late 2024 release shifted that ceiling considerably; several of its variants matched GPT-4o on coding benchmarks while fitting on consumer hardware. The Qwen3 family continues that trajectory, and the pelican result is a data point suggesting that frontier-quality output on creative and spatial tasks is now reachable from a laptop.
This matters in a concrete way for people who build things on top of models. Local inference removes the per-token cost, the latency of an API round trip, the privacy considerations involved in sending data to a third-party service, and the dependency on uptime and rate limits you do not control. For prototyping, for tools that process sensitive data, for applications where API costs at scale do not pencil out, a locally runnable model that matches or exceeds a frontier API model on specific tasks changes the build calculus in a meaningful way.
What the SVG Result Says About Capability Density
The pelican result is specific enough that it would be a mistake to generalize it too broadly. Opus 4.7 is not uniformly worse than Qwen3.6-35B-A3B across the board; on many tasks it likely remains ahead. But the fact that a locally-run MoE model excels on a spatial construction task is worth noting separately from coding or reasoning benchmarks, because SVG generation is underrepresented in standard evals.
Most benchmarks measure text output: math proofs, code that can be executed and tested against expected outputs, multiple-choice questions with known correct answers. SVG generation requires the model to produce structured output where correctness is simultaneously aesthetic, geometric, and syntactic. A model that handles this well has developed representations that transfer across those modalities at once.
That Qwen3 handles it at 3 billion active parameters per forward pass is a point in favor of what MoE routing contributes to capability density. The experts that activate for a spatial construction task are, apparently, genuinely specialized in a way that produces better output than a dense model spending the same compute budget would.
For anyone building tools that generate diagrams, flowcharts, UI layouts, or any structured visual output from natural language, Qwen3.6-35B-A3B is worth benchmarking. The inference infrastructure is in place, the hardware requirements are within reach of a well-equipped developer workstation, and Willison’s result suggests the output quality can clear a high bar on tasks that matter for those use cases.