There is a specific kind of benchmark that cuts through aggregate benchmark noise: ask a model to draw something. Not describe it, not list its properties, but produce working code that renders a recognizable image. Simon Willison has been using SVG generation as an informal capability probe for some time, and when he asked Qwen3.6-35B-A3B and Claude Opus 4.7 to draw a pelican, the model running locally on his laptop came out ahead.
The result is worth examining as a case study in what mixture-of-experts architecture has done to the relationship between model size and inference cost, rather than as a simple open-weights-versus-proprietary story.
Why SVG Generation Tests More Than Syntax
SVG is an XML-based vector format where shapes are described through coordinate geometry. A circle needs a center point and radius. A path needs a sequence of bezier commands with precise control points and arc parameters. A bird requires grouped elements, transforms that compose correctly, and enough anatomical plausibility that the output reads as recognizable when rendered.
When you ask a model to draw a pelican in SVG, you are testing several things at once. The model must know the SVG specification well enough to produce valid markup. It must translate a mental model of pelican anatomy into 2D coordinate space. It must manage a hierarchical document structure where elements relate to each other spatially, with the bill proportioned correctly relative to the body, the legs positioned below the torso, the wings articulated at the right attachment points.
This differs meaningfully from code generation tasks evaluated against unit tests. There is no runtime feedback loop, no linter catching coordinate errors, no test suite that flags a wing placed at the wrong offset. The model commits to specific numeric values in a single generation pass, and those values either compose into something recognizable or they do not. Models diverge sharply on this kind of task in ways that aggregate scores on MMLU or HumanEval do not predict well.
The 35B-A3B Naming and What It Means at Inference
The designation “35B-A3B” follows the mixture-of-experts naming convention that has become standard since Mistral’s Mixtral models popularized it: 35 billion total parameters, approximately 3 billion active during any single forward pass.
In a dense transformer, every parameter participates in processing every token. A 35B dense model requires roughly 70GB of memory in float16 precision, and inference throughput scales with the full parameter count. That is a server-class workload.
MoE models partition the feedforward layers, which typically account for roughly two-thirds of a transformer’s total parameters, into many independent expert networks. A learned routing function, trained alongside the rest of the model, selects a small subset of these experts for each token, typically two to eight out of dozens or hundreds. The computation for that token only touches the selected experts.
The practical consequence: inference compute and memory bandwidth scale with active parameters, not total parameters. A 35B-A3B model behaves closer to a 3B dense model from a throughput and hardware-requirements perspective, while encoding the knowledge of 35B parameters across its full expert set. Each expert specializes in different domains or token patterns through the training process; the routing function learns to dispatch tokens to whichever experts are most relevant.
Quantized to 4-bit with a format like GGUF Q4_K_M, the weight file for a 35B parameter model runs around 20 to 22GB. On a MacBook Pro with 32GB or 36GB unified memory, that leaves headroom for the KV cache at moderate context lengths. Inference speed for a 3B-active model at 4-bit quantization on modern Apple Silicon should be in the range of 20 to 40 tokens per second, which is comfortable for interactive use. llama.cpp handles this well on CPU and Apple Silicon; mlx-lm provides native Metal GPU execution; Ollama abstracts the quantization and serving details entirely.
The architecture is not new. Mixtral 8x7B, released in late 2023, demonstrated that a model with 46.7B total parameters and roughly 13B active could outperform LLaMA 2 70B on most benchmarks while fitting on hardware that could not run a dense 47B model. DeepSeek’s MoE variants, Qwen’s own earlier MoE models, and a series of refinements from other labs have improved training stability, expert utilization, and the quality of the routing mechanism over the two years since. By the time Qwen3 arrived, the recipe was considerably more mature.
The Qwen Family’s Trajectory
Alibaba’s Qwen series has followed a consistent pattern since the original Qwen-7B in 2023: launch competitive, iterate quickly, and invest disproportionately in coding and multilingual capability. Qwen2-72B was genuinely competitive with LLaMA 3.1-70B across most standard benchmarks. Qwen2.5 added substantial improvements in structured output and code generation, with the separate Qwen2.5-Coder models reaching near-GPT-4 level performance on coding benchmarks. The QwQ reasoning model showed that Alibaba was willing to invest in chain-of-thought infrastructure, not just base capability.
The Qwen3 MoE variants build on this foundation with the additional efficiency benefits of the architecture. But there is a less quantifiable factor worth noting: training data composition.
Alibaba’s training data access differs from Anthropic’s or OpenAI’s. The Chinese web, Chinese technical documentation, and Chinese design and engineering resources represent a different distribution than English-language corpora. Whether Qwen3’s SVG performance specifically reflects training on more SVG examples, more graphics programming documentation, or something else in the data pipeline is not something that can be determined from the outside. But capability differences at specific tasks between models with similar aggregate scores frequently trace back to training data composition rather than architecture alone.
What the Pelican Result Represents
When Willison reports that the local model drew a better pelican, the most likely meaning is that the Qwen3 output produced a more anatomically plausible bird: better proportioned, with cleaner path definitions, more appropriate use of SVG structural features like <g> grouping and transform attributes, and coordinate values that compose into something a viewer recognizes as the intended subject.
This is not a general capability reversal. Claude Opus remains a very capable model for complex reasoning, long-context analysis, nuanced instruction following, and tasks that require extensive world knowledge integrated with careful judgment. The SVG result does not change that.
What it does demonstrate is that the capabilities of frontier API models and capable local MoE models are no longer cleanly separable by tier. There are tasks where the local model wins, and those tasks are not always the ones you would predict from headline benchmarks. SVG generation is an example; there are others.
The operational comparison has also shifted. An API call to a frontier model carries per-token cost, network latency, data leaving your environment, and dependency on upstream availability. A local model has none of those costs once it is running. For tasks where quality is comparable or where the local model is stronger, the cost-benefit analysis points clearly in one direction.
The Broader Pattern
The default assumption for serious AI workloads over the past few years has been that frontier API models are unambiguously more capable and local models are compromises you accept for cost or privacy reasons. That framing is increasingly inaccurate.
The efficiency gains from MoE architecture mean that a model with 35B stored parameters but 3B active parameters at inference can deliver quality that tracks much closer to its total parameter count than its active parameter count. Training recipe improvements, better data curation, and improved quantization methods have narrowed the gap between a 4-bit quantized local model and its full-precision equivalent. The tooling for local deployment, through llama.cpp, Ollama, and mlx-lm, has matured to the point where running a large MoE model locally is a straightforward operation for anyone with adequate hardware.
The result is a capability distribution that is less hierarchical than it was two years ago. Frontier models lead on some tasks. Local models lead on others. The productive approach is task-by-task evaluation rather than defaulting to the highest-tier available option.
A pelican drawn well on a laptop is a small observation. But the architecture and training trajectory behind that pelican are worth understanding, because they describe where the efficiency frontier is, and the direction it has been moving.