What a Pelican Drawing Reveals About Local Model Quality in 2026
Source: simonwillison
Simon Willison has been running the same informal benchmark across LLMs for years: ask the model to draw a pelican in SVG. No reference image, no coordinate hints, no feedback loop. The model generates raw markup in a single pass and either produces something recognizable or produces garbage. He documented the latest result on April 16, 2026: Qwen3.6-35B-A3B, running entirely on his laptop, produced a better pelican than Claude Opus 4.7.
This is worth sitting with for a moment. Opus 4.7 is Anthropic’s flagship frontier model, running on server infrastructure at an estimated trillion-parameter scale. The Qwen model has 35 billion total parameters, with approximately 3 billion active during any given forward pass. It fits in around 20 GB of RAM at 4-bit quantization. It runs on a MacBook Pro.
Why the Pelican Test Is a Reasonable Benchmark
The pelican drawing prompt is not a gimmick. Willison has been refining and running it long enough that it functions as a consistent quality signal across model generations. The core prompt asks for valid SVG markup representing a pelican, sometimes with the added constraint of a bicycle. No tools, no image generation, no runtime feedback.
What makes this useful is the intersection of skills it probes. Generating correct SVG requires knowing the specification well enough to produce valid nested elements: <svg>, <g> groups, <path> with properly formed d attributes, <ellipse>, <circle>, <rect>. Getting the markup right is the minimum bar. Producing something recognizable requires something harder: translating a spatial model of pelican anatomy into 2D coordinate values. The bill has to be proportioned relative to the head, the head attached to the neck at the right offset, the body below the neck, the legs below the body. None of these positions can be looked up; they have to be synthesized from geometric reasoning.
The bicycle variant adds a second object and a spatial relationship: the pelican is riding, so its legs connect to the pedals at a location that makes sense given where the bike wheels and seat are. Models that produce plausible individual objects but place them in spatially incoherent relationships fail this test in an immediately visible way.
Structured code generation benchmarks like HumanEval test whether a model can write functions that pass unit tests. The pelican test has no test suite. It probes geometric reasoning and structured output quality in a domain where errors are immediately visible but not mechanically scorable, which is roughly the situation you encounter in real creative and technical writing tasks.
Willison maintains a running history of results at his site, making it possible to compare across models over time. The Qwen3.6-35B-A3B result stands out because it represents the first time a locally runnable model has, in his assessment, beaten a current frontier model on this specific test.
What Qwen3.6-35B-A3B Actually Is
The model name follows the Mixture-of-Experts (MoE) naming convention: 35 billion total parameters across all expert networks, with roughly 3 billion active for any single token during inference. The “35B-A3B” suffix encodes this directly. The Qwen team at Alibaba released this as part of the broader Qwen3 generation in early 2026.
MoE architecture partitions the feedforward layers of a transformer into many independent expert networks. A learned routing function selects a small subset of those experts per token, typically 2 to 8 out of dozens. The feedforward layers account for roughly two-thirds of transformer parameters, so the MoE design means you store many parameters in memory but compute across far fewer on each forward pass. For the 35B-A3B model, the active parameter ratio is approximately 12:1.
This matters for local inference because inference speed is bound by memory bandwidth, not raw FLOP count. Autoregressive generation reads the full model weights from memory for each token. With only 3 billion parameters active per forward pass, the MoE model generates tokens at a rate that corresponds to a 3B dense model, not a 35B one. In practice, reported speeds on Apple Silicon using mlx-lm are in the range of 40 to 60 tokens per second, roughly double what a comparable dense 30B model achieves on the same hardware.
Apple Silicon’s unified memory architecture is particularly suited to this workload. The M-series chips share memory between CPU and GPU with no PCIe bus, delivering sustained memory bandwidth to the inference engine. At 4-bit quantization using GGUF or MLX’s native format, the 35B-A3B model occupies roughly 18 to 22 GB, fitting on a 24 GB MacBook Pro with enough headroom for the KV cache.
The Qwen3 family also ships with a dual-mode design: a “thinking” mode that enables extended chain-of-thought reasoning, suited for math and logic problems, and a “non-thinking” mode for direct, low-latency responses. For SVG generation, non-thinking mode is appropriate, and that is almost certainly what Willison used.
Running It
If you want to reproduce this, there are three practical paths. Ollama is the lowest-friction option:
ollama run qwen3:30b-a3b
Ollama handles quantization and serving automatically. For better raw throughput on Apple Silicon, mlx-lm uses native Metal GPU execution:
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3-30B-A3B-4bit \
--prompt "Draw a pelican riding a bicycle in SVG"
Simon Willison’s own llm CLI integrates with Ollama and other backends, providing a consistent interface for running and comparing prompts across models, which is likely what he used for the comparison. The hardware requirement is the binding constraint: you need at least 24 GB of unified memory on Apple Silicon or equivalent VRAM. A 32 GB MacBook Pro M3 or M4 handles this comfortably; 16 GB will struggle with the 35B variant and should use the Qwen3-7B or Qwen3-14B dense models instead.
The Efficiency Trend Behind This Result
The Qwen3.6-35B-A3B result is a data point in a trend that has been building for two years. In January 2025, DeepSeek released R1 under an MIT license. It matched OpenAI o1 on multiple reasoning benchmarks at a reported training cost of roughly $5.6 million, a fraction of the estimated cost for comparable models from US labs. The efficiency gap was attributed partly to architectural choices and partly to the constraints of operating under US chip export controls, which forced optimization that labs with unlimited compute access had less incentive to pursue.
The Qwen series traces a similar arc. Qwen2.5, released in late 2024, was competitive with LLaMA 3.1-70B on coding benchmarks while fitting in smaller memory footprints. QwQ-32B, the dedicated chain-of-thought reasoning variant, matched or exceeded several frontier models on math and logic benchmarks. By early 2026, the Qwen family had accumulated over 113,000 derivative fine-tuned models on Hugging Face, more than Meta’s Llama and DeepSeek combined, reflecting how deeply it had been adopted as a foundation for further work. The Apache 2.0 license throughout the Qwen series has been a significant driver of this adoption; commercial use is unrestricted.
The broader dynamic is that hardware scarcity imposed by export controls created an engineering pressure that open-weight Chinese labs responded to with architectural efficiency. MoE was not invented for this purpose, Google published the Mixture of Experts paper in 2017 and used it in Switch Transformer, but the aggressive deployment of MoE at the 30-35B weight class for local inference is a more recent development, and one that is yielding real quality results.
Where Frontier Models Still Lead
The pelican result should be read precisely, not broadly. Frontier models like Opus 4.7 retain meaningful advantages on tasks that depend on long-horizon coherence: agentic workflows involving dozens of sequential tool calls, complex multi-file code refactors with many interdependencies, extended reasoning chains over large context windows, and tasks requiring broad factual grounding across diverse domains.
Claude Opus 4.7, released in April 2026, specifically improved extended thinking efficiency and multi-step tool-call coherence compared to Opus 4.6. Its 200,000-token context window and the improvements to error recovery in long agentic sessions represent capabilities that a locally running 35B model cannot match in the same way. For tasks that require sustained coherent reasoning across very long contexts, the scale difference still shows.
What has changed is the threshold below which local models are the better choice. SVG generation, structured document creation, code explanation, short-horizon reasoning, and many text transformation tasks now sit below that threshold. The pelican test is useful precisely because it sits at a point where you would not have expected a laptop model to win as recently as a year ago.
The practical implication for developers is that the decision of “local vs. cloud” is increasingly task-specific rather than a blanket quality tier. Running Qwen3.6-35B-A3B locally for structured generation tasks that fit its strengths, while routing to a frontier model for extended agentic sessions or tasks requiring deep factual grounding, is a reasonable architecture. The tooling to support this, via Ollama, mlx-lm, or Willison’s llm CLI, is mature enough to make it straightforward.
The pelican drawing is a small result. It is also a clear signal that the line between “local” and “frontier” is no longer a fixed quality boundary; it is a moving and increasingly task-dependent threshold, and it is moving faster than most people expected.