· 7 min read ·

After DeepSeek's Blueprint: How China's AI Labs Made Different Architectural Bets

Source: huggingface

One year after DeepSeek-R1 rewrote expectations for what open-source AI could do, a retrospective from Hugging Face, originally published January 27, 2026, traces how China’s broader ecosystem has evolved. The piece identifies three major trends: MoE architectures becoming default infrastructure, multimodal capabilities moving from differentiator to table stakes, and persistent momentum behind smaller, deployment-oriented models.

What rewards closer examination is the degree of architectural divergence beneath those trends. DeepSeek published detailed technical reports covering MLA (Multi-head Latent Attention), fine-grained expert routing, FP8 training, and auxiliary-loss-free load balancing. These were available to every other lab. One year later, the rest of China’s open-source ecosystem has not converged on those designs. Understanding why clarifies more about the ecosystem than cataloging which trends nominally won.

MoE Became the Default, but Not the Same MoE

The spread of MoE architectures across Chinese open-source releases is real, but the specific design choices vary enough that “MoE is the default” understates the disagreement still embedded in those releases.

DeepSeek-V3, documented in their December 2024 technical report, runs 671B total parameters with 37B active per forward pass, using 256 experts per MoE layer with top-8 routing. The large expert count is deliberate. The DeepSeek-V2 paper validated that finer-grained experts, smaller individually but more numerous, produce better routing specialization than fewer larger experts as in standard Mixtral-style designs. Tencent’s Hunyuan-Large, described in arXiv:2411.02265, moved in a similar direction with 16 experts and a shared always-active expert per layer, echoing the DeepSeek design pattern at a different scale.

Alibaba runs both tracks in parallel. Qwen1.5-MoE used 64 experts with top-4 routing; the reported Qwen2.5-Max scales further with a similar MoE structure, though full architectural details remain unpublished. In parallel, Qwen2.5-72B dense continues as a maintained and benchmarked model. That dual commitment is telling. MoE architectures improve inference throughput at scale but introduce real operational complexity: load balancing, routing overhead, and batching strategies that behave differently than dense model serving. For a lab distributing weights to a community running heterogeneous infrastructure, the engineering overhead lands on users rather than the lab’s serving team.

DeepSeek’s solution to the load-balancing problem in V3, adjusting per-expert bias terms dynamically based on recent load rather than embedding balance constraints in the training objective, is documented in their technical report. Whether other labs are adopting this specific mechanism is not confirmed in public releases, but the fact that DeepSeek felt the need to invent it signals how real the problem is in production MoE deployment.

The Innovation That Didn’t Propagate

Multi-head Latent Attention is arguably DeepSeek’s most consequential architectural contribution, and its limited adoption outside DeepSeek is worth examining carefully.

In standard multi-head attention, the KV cache grows as 2 × n_heads × d_head per token per layer. At 128 attention heads with 128-dimensional heads, that is 32,768 values per token per layer. MLA compresses this into a latent of dimension 512, from which K and V are reconstructed at inference via up-projection matrices. The resulting cache reduction is roughly 93%. DeepSeek’s serving reports attribute approximately 6x lower per-token inference cost to DeepSeek-V2 versus their earlier 67B dense model, with MLA as a primary contributor.

Yet Qwen2.5 uses grouped query attention throughout, as does InternLM3. GQA shares key-value heads across groups of query heads rather than compressing into a low-rank latent. It does not match MLA on KV cache compression, but it is straightforward to implement and well-supported in inference frameworks like vLLM and TensorRT-LLM. MLA requires special handling for positional encoding: RoPE cannot be absorbed into the compressed latent because it is position-dependent, so DeepSeek maintains a separate set of decoupled RoPE keys alongside the latent cache. Implementing this efficiently requires custom kernel support that needs to propagate through every downstream inference optimization.

For DeepSeek, operating their own serving infrastructure end-to-end, absorbing that engineering cost is feasible. For labs distributing model weights to a community of users who will run them on off-the-shelf frameworks, an innovation with a wide integration surface is harder to justify, particularly when GQA already delivers meaningful cache reduction over standard MHA. The gap between MLA’s theoretical appeal and its adoption rate is a common pattern in systems software: an innovation can be technically superior but practically non-default when its integration surface exceeds the benefit for most deployment contexts.

Small Models and What They Are Actually Optimizing For

The Hugging Face retrospective frames China’s small model preference partly as an accessibility priority. The technical picture is more specific: these labs have been pursuing systematically overtrained small models, an economically motivated strategy with solid empirical grounding.

The Chinchilla scaling laws recommend scaling model size and token count proportionally for training compute efficiency. A 7B model trained to Chinchilla optimality would use roughly 140 billion tokens. Qwen2.5-7B was trained on 18 trillion tokens, approximately 2,571 tokens per parameter, roughly 18x the Chinchilla-optimal ratio. MiniCPM’s 2.4B model used over a trillion tokens, also well above Chinchilla recommendations.

The motivation is deployment economics. If inference cost dominates over training cost in a model’s operational lifetime, which is true for any model widely deployed outside the training lab, then the training compute spent pushing a 4B model to 7B-class quality is well spent. MiniCPM 3.0 at 4B scores 67.3 on MMLU; Qwen2.5-3B scores 65.6. Those numbers were competitive with 7B models from a year and a half earlier.

MiniCPM’s team at Tsinghua published the scaling research behind this strategy in arXiv:2404.06395. Their WSD (Warmup-Stable-Decay) learning rate schedule keeps the learning rate constant during a “stable” phase rather than decaying it continuously, which preserves the ability to adjust data curriculum partway through training without restarting. This is a training infrastructure innovation rather than an architecture one, but it makes aggressive overtraining practical by keeping curriculum flexibility open throughout the run.

DeepSeek’s small model strategy took a different route: distillation from R1. The R1 distillates, including R1-Distill-Qwen-1.5B, R1-Distill-Qwen-7B, and R1-Distill-Llama-8B, transfer reasoning capability downward through chain-of-thought supervision from the 671B parent model. According to the DeepSeek-R1 paper, R1-Distill-Qwen-7B reaches 55.5 on AIME 2024, a score that would have been frontier-level less than a year prior. The optimization target here differs from MiniCPM’s efficiency agenda: less concerned with general capability per parameter at small scale, more focused on propagating a specific trained skill into a deployable size class.

Multimodal: Three Different Architectural Bets

The multimodal work from Chinese labs over the past year produced some of the most technically differentiated releases, with distinct architectural philosophies that are not simply variations on a shared approach.

InternVL from Shanghai AI Lab is built around high-resolution document understanding. Their 6B InternViT encoder processes images through dynamic tiling, allocating variable numbers of patches to inputs at effective resolutions up to 3360x3360 pixels, as described in InternVL 1.5. InternVL2-76B scored 94.1 on DocVQA, surpassing GPT-4V at the time of its release. The tiling approach maximizes OCR and chart interpretation quality because fine spatial detail in documents requires high effective resolution. The cost is that image preprocessing complexity grows with resolution and visual token counts vary significantly across inputs.

Qwen2-VL, documented in arXiv:2409.12191, takes a different approach to the resolution problem. Rather than tiling with a fixed-resolution encoder, it accepts images at native resolution through a ViT with variable input dimensions. M-RoPE applies separate rotary position components to height, width, and time dimensions, which makes video and variable-frame-rate inputs architecturally native rather than special-cased. Qwen2-VL-72B scored 96.5 on DocVQA and 71.2 on Video-MME; the video score in particular reflects the architectural benefit of native temporal position encoding, since frames at arbitrary frame rates do not require special preprocessing.

MiniCPM-V operates under an explicit constraint the other two do not prioritize: the model must run usefully on mobile hardware. MiniCPM-V 2.6 runs on an iPhone 15 Pro at practical speeds. This required not just parameter reduction but tight control over visual token count. Their any-resolution training allocates patches based on actual image aspect ratio and content density rather than always consuming maximum patch budgets. The tradeoff is reduced peak resolution handling compared to InternVL, but the deployment target makes that tradeoff coherent.

These three are genuinely different architectural bets. They reflect different views on where multimodal AI will primarily run (data centers versus devices), which tasks matter most (video understanding, document parsing, or on-device inference), and which component to treat as fixed versus variable.

The Pattern in the Divergence

Reading the Hugging Face retrospective alongside the technical details of these releases, the three macro trends it identifies are all real. What they do not surface is that each trend contains genuine architectural disagreement, and that disagreement maps clearly onto different deployment targets.

DeepSeek’s innovations make most sense for a lab running large-scale inference infrastructure in-house and willing to invest in custom systems work. Qwen’s choices prioritize broad adoptability and compatibility with existing tooling. MiniCPM’s choices reflect a research agenda centered on capable models running on constrained hardware. InternVL’s choices reflect a target use case in document-heavy enterprise tasks requiring high-resolution vision.

None of those targets are in conflict, and the ecosystem is more useful because the bets are distributed across them. What the year since the DeepSeek moment demonstrated is that the path from “efficient training is achievable” to “here is the canonical architecture” is not singular. The technical reports were public. The design choices diverged anyway, because the deployment contexts are genuinely different. That divergence, more than the surface-level trend toward MoE or multimodal, is what makes China’s open-source AI ecosystem worth watching in detail.

Was this interesting?