Open Source as Infrastructure: What the Qwen Numbers Actually Tell Us
Source: huggingface
The Number That Matters More Than the $5.6 Million
In January 2025, DeepSeek-R1 dropped onto Hugging Face with an MIT license and benchmark scores that matched or exceeded OpenAI’s o1 on reasoning tasks. AIME 2024: 79.8% pass@1 versus o1’s 79.2%. MATH-500: 97.3% versus 96.4%. A 2029 Elo rating on Codeforces, placing it in the 96th percentile globally. Nvidia lost approximately $600 billion in market capitalization in a single day, January 27, 2025, the largest single-session loss for any company in US stock market history. The financial press called it a Sputnik moment.
That framing was understandable. DeepSeek-V3, the base model underneath R1, was reportedly trained on 2,048 H800 GPUs over roughly 55 days at a cost of approximately $5.576 million, against industry estimates of $100 million or more for comparable Western frontier models. The training data ran to 14.8 trillion tokens. The architecture, a 671 billion parameter Mixture-of-Experts model with only 37 billion parameters active per forward pass, used innovations like Multi-head Latent Attention and FP8 mixed-precision training to work around the lower interconnect bandwidth of H800 GPUs, which China had access to because export controls had already blocked the H100 and A100.
The cost story is real and it mattered. But one year on, a retrospective from Hugging Face published in February 2026 surfaces a number that tells a different kind of story: Alibaba’s Qwen models had, by mid-2025, generated over 113,000 derivative models on Hugging Face. Meta’s Llama, which had a substantial head start and the full weight of a company spending billions on AI, had around 27,000 derivatives. DeepSeek itself had around 6,000. Qwen has more derivatives than Google and Meta combined.
What Derivatives Actually Represent
When researchers, developers, or companies fine-tune a model, they pick a base. That choice is usually driven by license compatibility, architecture documentation quality, available tooling, benchmark performance at the target size, and familiarity within their team. Over time the choice becomes self-reinforcing: more tutorials, more LoRA adapters, more GGUF quantizations, more community support.
Derivative counts are therefore a reasonable proxy for ecosystem capture. They tell you which model architecture has become infrastructure, the thing people build on rather than the thing people evaluate and move on from.
Qwen’s numbers are striking in that context. The Qwen2.5 series covers 0.5B through 72B parameters, plus dedicated variants for code (Qwen2.5-Coder), mathematics (Qwen2.5-Math), and vision-language tasks (Qwen2.5-VL). The QwQ-32B reasoning model followed. This is not a single flagship release strategy. It is a continuous expansion across sizes, modalities, and task domains, with Apache 2.0 licensing throughout, designed to make Qwen the obvious base regardless of what someone is building.
The Hugging Face blog describes this as an “ecosystem and infrastructure play.” That phrasing is accurate and worth sitting with. When your architecture has 113,000 derivatives, you are no longer just a model provider. You are a platform.
How Export Controls Produced an Efficiency Arms Race
The hardware constraint story is important for understanding why DeepSeek-V3’s architecture looks the way it does. The US Bureau of Industry and Security progressively tightened export controls on advanced semiconductors to China starting in October 2022 and continuing through 2024, blocking access to Nvidia’s H100, A100, and later even the A800 and H800 that Nvidia had designed to slip under earlier thresholds.
This forced Chinese AI labs to develop on hardware with significantly reduced chip-to-chip interconnect bandwidth, which is the bottleneck for large-scale distributed training. The response was architectural: more efficient attention mechanisms (MLA reduces KV cache memory requirements substantially), MoE routing that minimizes communication overhead, and custom CUDA kernels that squeeze more throughput from available hardware. The result is a model that trains faster and at lower cost not because DeepSeek found a free lunch, but because the engineering team had to solve problems that labs with unconstrained H100 access could simply buy their way out of.
This is worth noting because it complicates the “export controls are working” narrative. The constraints may have delayed capability development, but they also generated architectural innovations that subsequently transferred to open-source releases. DeepSeek-R1 is MIT licensed. Anyone can use it, train on it, or distill from it. The distilled variants, including DeepSeek-R1-Distill-Qwen-32B, benchmark competitively against OpenAI’s o1-mini at a fraction of the size and cost. The efficiency research done under hardware scarcity became a gift to the global open-source community.
The RL Angle on DeepSeek-R1
The DeepSeek-R1 paper is worth reading closely if you have not. The model was trained primarily via reinforcement learning using Group Relative Policy Optimization (GRPO), a variant that avoids the need for a separate value/critic model, reducing compute overhead compared to approaches like PPO. The training used rule-based rewards for math and code: format compliance, answer correctness, and consistency checks.
The interesting finding was that extended chain-of-thought reasoning, self-verification, backtracking, and re-evaluation of assumptions emerged from this process without being explicitly programmed. These behaviors appeared as the model scaled up RL training. This connects to the broader literature on inference-time compute scaling, the idea that allowing a model more tokens to “think” before answering can substitute for additional training compute. DeepSeek-R1-Zero, a version trained purely on RL without any supervised fine-tuning, showed these behaviors even more starkly, at the cost of some output formatting coherence.
For someone building AI applications, this matters because it suggests that capable reasoning models can be produced at substantially lower cost than was previously assumed, and that the frontier is no longer a function of raw training compute alone.
”AI+” and the Infrastructure Endgame
The “AI+” framing in the Hugging Face blog traces back to a Chinese government strategic initiative, broadly analogous to the “Internet+” policy from around 2015, which aimed to integrate internet infrastructure into every industrial sector. The AI+ version has the same shape: use AI as a horizontal layer that plugs into manufacturing, logistics, finance, healthcare, and consumer services at scale.
For this strategy to work, you need a model that is cheap to deploy, well-documented, available in sizes that run on commodity hardware, and widely understood by developers across many domains. Open-source is not incidental to this goal. It is the mechanism. If Qwen’s architecture is embedded in 113,000 derivative models, deployed across thousands of companies, fine-tuned for hundreds of specific tasks, then Alibaba’s design choices, training methodology, and tokenizer are de facto standards in a large part of the global AI stack.
This is a different kind of competition than model benchmarks suggest. The question is not only which model scores highest on MMLU or LiveCodeBench at a given point in time. It is which architecture becomes the default substrate that the rest of the ecosystem builds on.
Mistral AI has been pursuing a version of this from France, with Mistral 7B and the Mixtral MoE models establishing a strong European presence in open-source LLM development. Meta’s Llama 3 series, particularly the 405B and 70B variants, still command significant attention. But the derivative model data suggests Qwen has moved faster and more broadly in terms of actual ecosystem integration.
What This Means for Anyone Building on Open Models
The practical upshot for developers is that the pool of capable open-source base models has become genuinely competitive with closed API providers for many tasks, and the geographic and organizational diversity of that pool is wider than it was two years ago. If you are building a coding assistant, a document processor, a reasoning pipeline, or a fine-tuned domain-specific model, the decision of which base to use now involves real trade-offs rather than a default choice of whatever OpenAI offers.
Qwen’s Apache 2.0 license makes it usable commercially without the same friction as some other licenses. The model size range, from 0.5B to 72B, means you can run variants locally, on a single GPU, or in constrained environments while staying within the same architectural family. The 113,000 derivative count means that community tooling, quantizations in GGUF and GPTQ formats, and fine-tuning guides are widely available.
DeepSeek-R1’s MIT license and strong reasoning benchmarks make it a serious option for tasks where chain-of-thought quality matters, particularly math and code. The distilled variants bring that capability down to 7B and 14B parameter models that run on consumer hardware.
The Hugging Face platform has crossed one million hosted models, and the composition of who is contributing the most-used foundation models has shifted considerably in the past year. That shift is not just a curiosity for AI researchers tracking geopolitics. It is the environment in which every developer building AI applications now operates.