Three Billion Active Parameters, Frontier-Class Results: The Design Choices Inside Qwen3.6-35B-A3B
Source: hackernews
The Qwen3.6-35B-A3B release landed on Hacker News with over 1000 points this week, and the benchmark table is what caught most people’s attention. A 73.4% score on SWE-Bench Verified and 92.7% on AIME 2026 from a model with 3 billion active parameters. Those numbers are worth unpacking, but so is the architecture that produces them.
What 35B-A3B Actually Means
The naming convention follows what has become standard shorthand for mixture-of-experts models: 35 billion total parameters, 3 billion activated per token. At inference time, a routing mechanism picks 8 of 256 available expert networks plus one shared expert that always activates, giving you 9 active expert paths per token.
This creates a real asymmetry between memory and compute. You need to hold the full 35 billion parameters in VRAM or system memory, because any expert might be called at any point. In fp16, that is around 70GB. Quantized to 4-bit, you get to roughly 17-18GB, which fits on a single RTX 4090 or comfortably in unified memory on Apple Silicon. But the forward pass computation is equivalent to running a dense 3B model. Token throughput is fast; the cost is memory bandwidth, not FLOPs.
The expert design itself is unusually fine-grained. Each expert has an intermediate dimension of only 512. A typical dense transformer FFN at this hidden dimension (2048) would use an intermediate dimension of 8192. Qwen instead spreads that capacity across 256 tiny specialists, activating a set of 8 each time. The shared expert with intermediate dim 512 always contributes. Total effective intermediate capacity per token is 9 × 512 = 4608, which is modest, but the model appears to compensate through the sheer diversity of specialist combinations available: 256 choose 8 is a very large routing space.
The Hybrid Attention Architecture
The parameter count story is familiar territory for MoE models. What is less familiar is the layer architecture, and this is where Qwen3.6 makes a more unusual bet.
The 40 layers follow a repeating block pattern:
10 × [
Gated DeltaNet → MoE
Gated DeltaNet → MoE
Gated DeltaNet → MoE
Gated Attention → MoE
]
So 30 of 40 attention layers use Gated DeltaNet, and only 10 use standard Gated Attention. The model is predominantly a linear attention architecture with periodic full attention checkpoints every fourth layer.
DeltaNet is a linear attention variant based on the delta rule, a form of memory update derived from associative memory research. The key property is that it processes sequences in linear time with respect to sequence length, avoiding the quadratic cost of standard self-attention. The gating mechanism adds learned input-dependent suppression of memory updates, similar to what gating does in GRUs or the Gated Linear Attention family of models.
The 3:1 ratio of linear to full attention layers reflects a specific hypothesis: most of what a language model does does not require quadratic attention. Smooth pattern continuation, syntactic structure tracking, general next-token prediction, these are tasks linear attention handles adequately. Full attention is expensive and most useful for precise in-context retrieval, accurate copying, and tasks requiring sharp positional discrimination. Placing it every fourth layer is a bet that you only need the expensive mechanism periodically.
This hybrid approach has precedent. Models like Jamba and Zamba interleave Mamba SSM layers with full attention layers. Qwen3.6 does something similar but using Gated DeltaNet rather than Mamba. The architectural difference matters: DeltaNet operates in attention-head space with QK/V projections rather than using separate convolutional state updates, which may allow better weight sharing and initialization strategies from pretrained full-attention models.
The GatedDeltaNet layers use 32 V-heads and 16 QK-heads with head dimension 128. The Gated Attention layers use 16 Q-heads and 2 KV-heads with head dimension 256, the latter being grouped-query attention. These are not symmetric configurations, which suggests the two attention types are not playing interchangeable roles in the residual stream.
What the Benchmarks Actually Show
The SWE-bench Verified score of 73.4% is strong, but the full comparison table reveals something more interesting when you look across benchmarks:
| Model | SWE-bench Verified | Terminal-Bench 2.0 |
|---|---|---|
| Qwen3.5-27B (dense) | 75.0 | 41.6 |
| Qwen3.5-35B-A3B (prior MoE) | 70.0 | 40.5 |
| Gemma4-31B | 52.0 | 42.9 |
| Qwen3.6-35B-A3B | 73.4 | 51.5 |
Qwen3.6 does not top SWE-bench in this table; the denser Qwen3.5-27B does. What Qwen3.6 does is jump Terminal-Bench 2.0 from 40.5% to 51.5%, a 27% relative improvement over the previous MoE generation. Terminal-Bench evaluates multi-step terminal tasks: writing scripts, navigating filesystems, chaining commands, debugging shell errors. It is a more honest proxy for what an agentic coding assistant actually does in practice than SWE-bench, which tests isolated PR-level patch generation.
The model also scores 49.5% on SWE-bench Pro, a harder evaluation that adds tests around multi-file coordination and ambiguous specifications. For an open-weight model, this is competitive.
The Agentic API: preserve_thinking
The most practically interesting addition for anyone building on this model is the preserve_thinking parameter in the API:
response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=messages,
extra_body={
"chat_template_kwargs": {"preserve_thinking": True},
},
)
In a standard multi-turn agentic loop, when the model produces a reasoning trace inside <think> tags before its final response, that trace is stripped before being stored in the conversation history. The next turn sees only the model’s output, not its reasoning process.
preserve_thinking=True retains those traces in context across turns. For a coding agent running tool calls over multiple steps, this means the model’s current reasoning turn can reference what it was weighing in earlier turns. It can notice when a plan it formed three tool calls ago no longer applies to what it has since learned from tool outputs.
This addresses a real failure mode in long agent loops: models that appear to forget their own reasoning and reverse course on decisions they already made. Whether the performance gain justifies the extra tokens in context depends on the task, but having the option per-request is the right design.
The model also supports toggling thinking mode off entirely for latency-sensitive tasks:
extra_body={
"chat_template_kwargs": {"enable_thinking": False},
}
Sampling parameters differ between modes. Thinking mode for coding uses temperature=0.6, top_p=0.95, presence_penalty=0.0. Non-thinking instruct mode uses temperature=0.7, top_p=0.8, presence_penalty=1.5. The higher presence penalty in instruct mode discourages repetition without a reasoning chain to self-correct.
The Context Window
Native context is 262,144 tokens. Extended to 1,010,000 tokens with YaRN scaling, a technique that adjusts rotary position embeddings to generalize beyond training length. For agentic coding work, 262K is already substantial: you can fit a moderately large codebase, a full tool call trace, and substantial intermediate reasoning in a single context.
The multimodal capability is less prominent in the release framing but present: the model processes images and video alongside text, which matters for tasks involving UI screenshots, diagrams, or visual debugging context.
Running It
The recommended serving setup uses SGLang with tensor parallelism across 8 GPUs:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-35B-A3B \
--port 8000 \
--tp-size 8 \
--context-length 262144 \
--reasoning-parser qwen3
vLLM works as well with equivalent flags. Both serve an OpenAI-compatible API, so any existing code using the OpenAI Python SDK connects without changes.
For single-node deployment, a 4-bit quantized version at roughly 17-18GB fits in 24GB VRAM with reduced context length, or in full precision on hardware with larger memory pools. The model is on Hugging Face under a license that permits commercial use.
Where This Sits
The open-source coding model landscape has been moving fast. DeepSeek-Coder-V2, Codestral, and successive Qwen-Coder releases have each narrowed the gap with proprietary frontier models. Qwen3.6 is the clearest expression yet of the MoE-plus-hybrid-attention strategy for getting frontier-comparable outputs at sub-frontier compute costs.
The interesting technical choice is not the MoE itself; that is established. It is the commitment to Gated DeltaNet as the dominant attention mechanism, with standard full attention playing a supporting role in 25% of layers. If that architectural bet holds up under real workloads, it has implications beyond this model: it suggests that linear attention variants are ready to carry production reasoning loads, not just serve as a compute reduction trick on the margins.