· 6 min read ·

35 Billion Parameters, 3 Billion Active: The Architecture Behind a Local Model Beating Frontier Output

Source: simonwillison

Simon Willison published something worth paying attention to this week: he ran Qwen3-30B-A3B on his laptop and it produced a better SVG pelican than Claude Opus 4.7. That headline sounds like model benchmarking discourse, but the technical story underneath it is more interesting than a leaderboard ranking.

The key is in the model name itself. That A3B suffix means 3 billion active parameters per forward pass. The full weight count is 30-35 billion parameters, but during inference the model only routes each token through a 3-billion-parameter slice of the network. This is the Mixture-of-Experts architecture, and at this ratio it changes the economics of local inference completely.

What Mixture-of-Experts Actually Does

A dense transformer model, like the original Llama or GPT-2 style architectures, passes every token through every parameter in the network on every forward pass. If you have a 30B dense model, you’re computing through all 30 billion weights for each token generated. Memory bandwidth is the bottleneck on consumer hardware, so a 30B dense model is limited by how fast you can stream 30 billion parameters from memory to the compute units.

A MoE model splits the feed-forward layers into multiple “expert” sub-networks and uses a learned router to select a small subset of those experts for each token. In Qwen3-30B-A3B, there are enough experts to total 30-35 billion parameters, but only 3 billion of them activate per token. The rest sit in memory, loaded but unused during any given forward pass.

The practical effect: the memory bandwidth requirement drops to roughly that of a 3B dense model, while the model quality benefits from having been trained across 30-35 billion parameters worth of capacity. You get a much larger capacity network at the inference cost of a small one.

Qwen3 is not the first to use this design. DeepSeek-R1 uses a 671B total / 37B active configuration, which is why it can run on two consumer GPUs despite its nominally enormous parameter count. Mixtral 8x7B, released in late 2023, was the architecture that brought wide attention to MoE at the open-weights level: 46.7B total parameters, 12.9B active, fitting on hardware that could not comfortably run a dense 47B model. Qwen3-30B-A3B takes a more aggressive ratio, compressing the active count to about 10% of total.

The Pelican Test

Willison’s SVG pelican benchmark is a deliberately awkward task. The prompt asks a model to draw a pelican riding a bicycle, as SVG, entirely from scratch. No tools, no image generation, just raw SVG markup produced by the model.

The task is designed to be uncomfortable for models in specific ways. A pelican on a bicycle requires the model to reason about the relative geometry of two things: a bird with a beak, wings, body, and legs, and a bicycle with handlebars, a frame, pedals, wheels, and a seat. The pelican’s legs must connect to the pedals in a physically plausible location. The beak must be at the front of the bird. The bicycle wheels must be on the ground if the image has a ground plane. None of this is explicitly instructed; the model has to infer what a plausible arrangement looks like and then express it in valid SVG coordinate space.

This is harder than it sounds. SVG is a coordinate-based format where small errors in geometry produce visibly broken results. A model that hallucinates attribute names, forgets to close path elements, or places the pelican inside one of the wheels will produce either invalid SVG or something visually nonsensical. The test cannot be gamed by retrieving a memorized answer, because the specific combination of animal and vehicle is chosen to be unlikely to appear verbatim in training data.

Willison has run this test across many models, publishing the rendered outputs. The differences are visible and immediate. Some models produce a rough but recognizable pelican on a vaguely bicycle-like structure. Others produce a smear of misaligned circles and lines. The test gives a qualitative signal that’s faster to interpret than reading through numeric benchmark tables.

The fact that Qwen3-30B-A3B produced a better result than Claude Opus 4.7 on this test is notable for what it implies about spatial reasoning and SVG fluency, not just raw capability on saturated benchmarks.

Running It on Consumer Hardware

Apple Silicon’s unified memory architecture is what makes a model like this practical on a laptop. The M-series chips share a single memory pool between CPU and GPU, and that pool has extremely high bandwidth compared to the DDR5 in a typical x86 laptop. An M3 Max with 128GB of unified memory can load the full 30-35B parameters of Qwen3-30B-A3B in Q4 quantization, which requires roughly 20-24GB, and serve inference at speeds that would not have been possible on consumer hardware two years ago.

Because only 3B parameters activate per token, the inference speed is much closer to what you’d expect from a 3B dense model than a 30B one. Users running Qwen3-30B-A3B via MLX on Apple Silicon have reported 40-60 tokens per second, which is fast enough for interactive use without the usual latency of larger open-weight models.

For comparison, a dense 30B model like Llama-3.3-70B (at Q4) on the same hardware runs at roughly 15-20 tokens per second. The MoE architecture at this active-parameter ratio delivers roughly double the generation speed, which in practice is the difference between a model that feels sluggish and one that feels responsive.

llama.cpp supports GGUF-quantized versions of the model on Windows, Linux, and macOS. ollama makes the installation straightforward. The practical barrier to running Qwen3-30B-A3B is now just having the hardware with sufficient unified or VRAM capacity.

The Efficiency Gap Is the Story

Cloud frontier models like Claude Opus operate at an entirely different scale. Opus-class models are estimated to be in the trillion-parameter range, running on Anthropic’s server infrastructure with hardware and energy costs far beyond what any laptop handles. The fact that a 30B total / 3B active local model can match or exceed Opus on specific benchmarks does not mean Opus is not better on average. It means the efficiency frontier has moved.

Qwen3-30B-A3B is not a general Opus replacement. On long reasoning chains, complex multi-step coding tasks, and tasks requiring broad world knowledge, a trillion-parameter model retains advantages that a 30B-total model cannot fully replicate. But on specific qualitative tasks, including the kind of creative-and-structural task that SVG generation represents, the gap has closed to the point where the local model can win on a given example.

The Qwen family’s trajectory makes this more pointed. As of early 2026, Qwen models account for over 113,000 derivative fine-tuned models on Hugging Face, more than any other open-weight family. The ecosystem around Qwen is large, which means quantization tooling, fine-tuned variants, and community benchmarks are abundant. Alibaba’s continued investment in releasing competitive open weights has compressed the gap between what you can run locally and what requires a cloud API in ways that were not predictable even a year ago.

Qwen3 also ships with a dual-mode design: a thinking mode that performs extended chain-of-thought reasoning before responding, and a non-thinking mode for fast direct responses. This mirrors the architecture of QwQ-32B and the DeepSeek reasoning models. The thinking mode is where Qwen3 performs best on logic and math tasks; the non-thinking mode is what you’d use for SVG generation or interactive conversation where latency matters.

What This Means for Using These Tools

For anyone who runs LLMs locally, the Qwen3-30B-A3B result should shift the mental model of what local inference is good for. It’s no longer the case that you run local models only when privacy, cost, or latency demand it despite accepting worse output quality. For a meaningful subset of tasks, the local model is the better tool.

SVG generation is one example. Willison’s pelican test is a proxy for a broader class of task: structured creative output that requires geometric or spatial reasoning. Logo generation, diagram sketching, simple UI layout. These are tasks where the model’s spatial intuition and SVG fluency matter more than the breadth of knowledge a trillion-parameter model brings.

The active parameter architecture is what enables this situation. The design decision to activate only 3 billion parameters during inference, while maintaining a 30-35 billion parameter capacity during training, is the engineering choice that makes a laptop-scale deployment feasible. Understanding that distinction changes how you reason about which models are practical for which deployment contexts.

The benchmark that matters is the one that matches your actual use case. Willison’s pelican test is not a comprehensive evaluation. It is a concrete, visual, reproducible task that reveals something specific about model capability. The fact that an open-weight model running locally can now clear that bar against a frontier closed model is a useful data point, whatever you conclude from it.

Was this interesting?