· 6 min read ·

Open Source as Infrastructure: What the Qwen Numbers Actually Tell Us

Source: huggingface

The Number That Matters More Than the $5.6 Million

In January 2025, DeepSeek-R1 dropped onto Hugging Face with an MIT license and benchmark scores that matched or exceeded OpenAI’s o1 on reasoning tasks. AIME 2024: 79.8% pass@1 versus o1’s 79.2%. MATH-500: 97.3% versus 96.4%. A 2029 Elo rating on Codeforces, placing it in the 96th percentile globally. Nvidia lost approximately $600 billion in market capitalization in a single day, January 27, 2025, the largest single-session loss for any company in US stock market history. The financial press called it a Sputnik moment.

That framing was understandable. DeepSeek-V3, the base model underneath R1, was reportedly trained on 2,048 H800 GPUs over roughly 55 days at a cost of approximately $5.576 million, against industry estimates of $100 million or more for comparable Western frontier models. The training data ran to 14.8 trillion tokens. The architecture, a 671 billion parameter Mixture-of-Experts model with only 37 billion parameters active per forward pass, used innovations like Multi-head Latent Attention and FP8 mixed-precision training to work around the lower interconnect bandwidth of H800 GPUs, which China had access to because export controls had already blocked the H100 and A100.

The cost story is real and it mattered. But one year on, a retrospective from Hugging Face published in February 2026 surfaces a number that tells a different kind of story: Alibaba’s Qwen models had, by mid-2025, generated over 113,000 derivative models on Hugging Face. Meta’s Llama, which had a substantial head start and the full weight of a company spending billions on AI, had around 27,000 derivatives. DeepSeek itself had around 6,000. Qwen has more derivatives than Google and Meta combined.

What Derivatives Actually Represent

When researchers, developers, or companies fine-tune a model, they pick a base. That choice is usually driven by license compatibility, architecture documentation quality, available tooling, benchmark performance at the target size, and familiarity within their team. Over time the choice becomes self-reinforcing: more tutorials, more LoRA adapters, more GGUF quantizations, more community support.

Derivative counts are therefore a reasonable proxy for ecosystem capture. They tell you which model architecture has become infrastructure, the thing people build on rather than the thing people evaluate and move on from.

Qwen’s numbers are striking in that context. The Qwen2.5 series covers 0.5B through 72B parameters, plus dedicated variants for code (Qwen2.5-Coder), mathematics (Qwen2.5-Math), and vision-language tasks (Qwen2.5-VL). The QwQ-32B reasoning model followed. This is not a single flagship release strategy. It is a continuous expansion across sizes, modalities, and task domains, with Apache 2.0 licensing throughout, designed to make Qwen the obvious base regardless of what someone is building.

The Hugging Face blog describes this as an “ecosystem and infrastructure play.” That phrasing is accurate and worth sitting with. When your architecture has 113,000 derivatives, you are no longer just a model provider. You are a platform.

How Export Controls Produced an Efficiency Arms Race

The hardware constraint story is important for understanding why DeepSeek-V3’s architecture looks the way it does. The US Bureau of Industry and Security progressively tightened export controls on advanced semiconductors to China starting in October 2022 and continuing through 2024, blocking access to Nvidia’s H100, A100, and later even the A800 and H800 that Nvidi