· 5 min read ·

Why the A3B in Qwen3.6-35B-A3B Matters More Than the 35B

Source: hackernews

Simon Willison has a standing benchmark: ask a language model to draw a pelican in SVG. The prompt is simple; the results differ widely across models. His latest run put Qwen3.6-35B-A3B ahead of Claude Opus 4.7. The part that warrants attention is that Qwen ran locally, on his laptop.

That result is interesting not because one model beat another on a quirky test, but because of what it implies about the current state of the architecture gap between local and frontier models.

Why Pelicans

SVG generation is a reliable qualitative differentiator for language models because it tests spatial reasoning in a domain where the model receives no visual feedback. HTML has forgiving rendering semantics; a typo in a class name degrades gracefully. SVG path commands do not forgive. The M, L, C, and A commands require precise coordinate arithmetic, and an error in a Bezier control point produces something that looks nothing like the intended shape. A model generating SVG has to maintain a coherent spatial representation in latent space and project it onto a coordinate system entirely in text.

Pelicans specifically are non-trivial: a long beak with a distinctive pouch, a large body with particular proportions, recognizable enough that a bad result is immediately obvious. A model that draws a pelican that looks like a pelican has demonstrated that its geometry representations transfer to novel output domains. That is not guaranteed for most models, which is why the test has circulated as an informal benchmark since at least 2023.

When Qwen3.6-35B-A3B produces a more recognizable pelican than Claude Opus 4.7, it does not establish that Qwen is categorically better. It establishes that on this particular spatial-geometric task, a laptop-runnable model is competitive with a frontier API. That is already a meaningful finding.

What the Model Name Tells You

The naming convention encodes the architecture. Qwen3.6-35B-A3B has 35 billion total parameters and approximately 3 billion active parameters per token. This is a Mixture of Experts model, a transformer variant where the feed-forward layers in each block are replaced by a set of specialized sub-networks, called experts, with a learned router that selects a small subset to activate for each token.

In a standard dense transformer, every parameter participates in every forward pass. Llama 3.3 70B uses all 70 billion parameters for every token it generates. In a MoE model, the router takes each token’s hidden state, scores all available experts, and selects the top K. The rest remain dormant. For Qwen3.6-35B-A3B, that means approximately 3 billion parameters are doing computational work per token regardless of the 35 billion sitting in memory.

This has a direct and concrete consequence for inference performance. The number of floating-point operations required to generate a token scales with active parameters, not total parameters. A 35B-A3B model generates tokens at roughly the computational cost of a dense 3B model. On an M-series MacBook with 64 or 128GB unified memory, you load the full 35 billion parameters into RAM, quantized to 4-bit for around 20-22GB, but your token throughput reflects 3 billion active parameters. On an M3 Max, that puts inference somewhere around 40-70 tokens per second depending on quantization and context length, which is fast enough to feel interactive for most tasks.

The Routing Mechanism

The MoE router is a learned linear layer that maps each token’s hidden representation to a score distribution over N experts. The model selects the top-K scoring experts and computes a weighted combination of their outputs, with the weights derived from the softmax of the router scores over that top-K subset.

The central challenge during training is expert collapse: without constraints, the router tends to route most tokens to the same few high-performing experts, and the rest become dead weight. Modern MoE implementations address this with an auxiliary load-balancing loss that penalizes uneven expert utilization. Some architectures also include a small set of always-active shared experts that provide consistent baseline computation independent of routing.

Qwen’s MoE lineage, including Qwen2.5-MoE-57B-A14B, uses shared experts alongside routed experts. DeepSeek-V3 uses the same pattern with 256 routed experts and 1 shared expert per MoE layer across 671B total / 37B active parameters. The shared expert design ensures the model can rely on consistent behavior for common patterns while routed experts specialize for specific token types, domains, or syntactic constructs.

The practical effect of well-balanced MoE training is that knowledge distributed across 35 billion parameters becomes accessible at the inference cost of 3 billion. This is why a well-trained MoE at 35B-A3B can outperform a dense model at comparable active-parameter counts on specific tasks: the total parameter budget enables richer specialization, and the router selects the right experts for each token.

Running It

The model is available through Ollama and as GGUF weights for llama.cpp. For Apple Silicon, the mlx-lm library provides fast inference using MLX’s unified memory architecture, which avoids the GPU VRAM bottleneck that makes large models awkward on discrete GPU setups.

With Ollama:

ollama pull qwen3.6:35b-a3b
ollama run qwen3.6:35b-a3b

With llama.cpp, Q4_K_M quantization gives the best size-to-quality tradeoff at around 20-22GB. Q5_K_M at roughly 25GB is worth it if you have the headroom. For MLX on an M-series Mac:

pip install mlx-lm
python -m mlx_lm.generate \
  --model mlx-community/Qwen3.6-35B-A3B-4bit \
  --prompt "Draw a pelican in SVG"

(Exact model tags vary by community upload; check Hugging Face for the current mlx-community quantizations.)

The memory requirement is the main hardware constraint: you need around 22-26GB available to load the quantized weights, plus headroom for your context window. On an M3 Max or M4 Max with 128GB unified memory, this is trivial. On a machine with 32GB, it fits with room to spare. On a 16GB laptop, you are hitting the ceiling.

The Broader Pattern

Qwen3.6-35B-A3B beating Opus 4.7 on this particular test is one data point in a pattern that has been building for two years. The gap between the best local models and frontier cloud APIs has been narrowing, and MoE architectures are a significant part of the mechanism. They decouple parameter count from inference cost, which means a well-trained MoE can pack the representational capacity of a large model into the operational budget of a small one.

DeepSeek-V3 demonstrated this at scale: 671B total parameters, 37B active, benchmark performance approaching GPT-4 class, inference cost of a 37B dense model. Qwen’s MoE series extends the same principle to progressively smaller active-parameter counts, and 35B-A3B sits in a tier where laptop-scale hardware is genuinely viable.

For developers deciding where to run inference, the active parameter count is the right number to reason about, not the total. A dense 35B model and Qwen3.6-35B-A3B have the same name pattern but completely different runtime profiles. The dense model needs a 48GB GPU or significant quantization compromise. Qwen3.6-35B-A3B runs fast on a MacBook.

Willison’s pelican is a small qualitative signal, but it points in the same direction that the quantitative benchmarks have been pointing for the past eighteen months: the capability threshold at which local inference becomes a reasonable choice, rather than a compromise, keeps moving downward.

Was this interesting?