· 3 min read ·

Why MoEs Make Large Models Cheaper to Run Than They Look

Source: huggingface

There’s a sleight of hand happening inside some of the most capable models being deployed today. A model advertised as having hundreds of billions of parameters doesn’t actually use all of them for any given token. That’s the core idea behind Mixture of Experts, and once you understand it, the economics of large language models start making a lot more sense.

The HuggingFace deep-dive on MoEs in transformers is one of the better technical explanations I’ve come across, so I want to unpack what makes this architecture worth paying attention to.

The Basic Idea

In a standard dense transformer, every token passes through every parameter on every forward pass. That’s clean and simple, but it’s expensive. With MoE, you replace the feed-forward layers with a collection of “expert” networks — each one a smaller FFN — plus a router that picks which experts handle each token.

Typically only the top-2 experts out of maybe 8 or 64 get activated per token. So a model with 64 experts sized for a 7B dense model might have the parameter count of something much larger, but the compute cost of something much smaller. You get capacity without paying full price at inference.

Mixtral 8x7B is the most visible example right now. It has 46.7B total parameters but only routes through ~12.9B per token. That’s roughly the compute cost of a 12B dense model with the knowledge capacity of something far bigger.

The Routing Problem

The interesting engineering challenge is the router itself. It’s a small learned linear layer that outputs a probability distribution over experts, and you take the top-k. Simple enough in theory.

In practice, training MoEs is tricky because of collapse: the router tends to strongly prefer a few experts and ignore the rest. You end up with load imbalance — some experts become overloaded specialists while others atrophy. Fixing this requires auxiliary loss terms that encourage balanced routing, and getting that balance right without hurting model quality is genuinely fiddly.

There’s also the infrastructure problem. At inference, you need all experts loaded in memory even though you’re only using a fraction of them per token. For single-GPU setups this hurts badly. MoEs shine on multi-GPU inference where you can distribute experts across devices — which is why they’re more common in data-center deployments than on-device scenarios.

Why This Architecture Is Winning

The scaling laws argument is compelling. If you have a fixed compute budget for training, a sparse MoE model can reach better perplexity than a dense model trained for the same FLOPs. The Switch Transformer paper demonstrated this clearly, and subsequent work has reinforced it.

The intuition makes sense: different experts can specialize on different types of content or syntax. You’re essentially letting the model develop internal structure rather than forcing every token through the same computation.

What I find most interesting is how this changes the tradeoff conversation. We tend to think about models in terms of parameter count as a proxy for capability, but MoEs break that heuristic. A 46B parameter MoE is not equivalent to a 46B dense model in either compute cost or behavior — it’s a different point in the design space entirely.

The Takeaway for Developers

If you’re running inference on models like Mixtral, the practical implications are:

  • Memory bandwidth matters more than raw compute
  • Multi-GPU setups with expert parallelism outperform naive tensor parallelism
  • Batch size matters a lot — MoEs amortize routing overhead better under load

The HuggingFace post gets into the implementation details with concrete code examples, which is worth reading if you’re thinking about fine-tuning or deploying these models. The routing mechanism in particular has implications for how you think about PEFT methods — you need to decide whether to train the experts, the router, or both.

MoEs aren’t new — they’ve been around in various forms for decades — but the transformer era has made them practical at scale. Expect to see more models built this way.

Was this interesting?