· 6 min read ·

Mapping the Design Space: What the LLM Architecture Gallery Actually Reveals

Source: hackernews

The architecture of a large language model is a set of engineering decisions. Each choice — attention mechanism, normalization strategy, positional encoding scheme, feed-forward activation — has a history, tradeoffs, and a traceable lineage across model families. Sebastian Raschka’s LLM Architecture Gallery collects these choices into one place, letting you compare dozens of models side by side in a way that makes the patterns immediately obvious.

Raschka, best known for his book Build a Large Language Model (From Scratch) and his prolific ML writing, built the gallery as a reference for practitioners who want to understand what actually distinguishes one model’s architecture from another. The Hacker News thread linking to it cleared 270 points, a reasonable signal that people find it useful. But the gallery is more interesting as a lens than as a lookup table. Reading it as a whole, a few patterns become clear.

The Convergence Story

The most striking thing about modern open-weight LLMs is how much they agree on. If you look at models released in 2023 and 2024 — LLaMA 2, LLaMA 3, Mistral 7B, Gemma, Phi-3, Qwen2, DeepSeek V2 — nearly all of them share a cluster of choices:

  • Pre-normalization with RMSNorm instead of the original post-LN LayerNorm
  • SwiGLU or GeGLU activations in the feed-forward layers, not ReLU or plain GeLU
  • Rotary positional encoding (RoPE) instead of learned absolute positions
  • Grouped-query attention (GQA) instead of full multi-head attention

This was not always the case. The original transformer used learned absolute positional embeddings and post-normalization. GPT-2 switched to pre-LN. BLOOM and MPT used ALiBi for position-free attention. The PaLM paper popularized SwiGLU. LLaMA 1 brought RoPE and RMSNorm into the open-source mainstream. Each change had evidence behind it, and as that evidence accumulated, the field converged.

RMSNorm removes the mean-centering step from LayerNorm, keeping only the root-mean-square rescaling. This makes it cheaper to compute while remaining equally stable during training. The original RMSNorm paper by Zhang and Sennrich demonstrated matching LayerNorm performance at lower computational cost; by 2023 it had become the default for most open-weight releases.

RoPE encodes position by rotating query and key vectors in the embedding space before attention is computed. The rotation is applied pairwise to dimensions, and the angle scales with position, so the dot product between a query at position m and a key at position n depends only on their relative offset. This relative-position property makes RoPE more generalizable at inference time than absolute positional embeddings and simpler to extend than ALiBi. Extensions like YaRN and LongRoPE have pushed context lengths into the hundreds of thousands of tokens by scaling the base frequency parameter or applying NTK-aware interpolation.

SwiGLU replaces the two-layer FFN in the original transformer with a gated unit. The computation is SiLU(xW1) * (xW2), where SiLU is the sigmoid-weighted linear unit. This gating mechanism gives the network more expressive capacity without a proportional increase in parameters. The PaLM paper reported consistent gains from SwiGLU over standard FFN variants. Nearly every major model released since 2023 uses SwiGLU or GeGLU, which substitutes GELU for SiLU in the gate branch.

Grouped-Query Attention and the KV Cache Problem

Full multi-head attention creates separate key and value projections for each attention head. At inference time, these must be stored in the KV cache for every token in the context. For large models running at long contexts and high batch sizes, the KV cache becomes the dominant memory cost, not the model weights themselves.

Multi-query attention (MQA), introduced in a 2019 paper by Shazeer, collapses all heads to a single shared KV head. This dramatically reduces cache size but can degrade quality on tasks requiring fine-grained attention across positions. Grouped-query attention (GQA), described in a 2023 paper from Google, finds a middle ground: heads are divided into groups, and heads within a group share a KV projection.

LLaMA 2 70B was among the first widely adopted models to use GQA. Mistral 7B used it alongside sliding window attention. LLaMA 3 applied it across all model sizes. The group count matters: fewer groups reduce memory but can hurt quality on long-context tasks. Most models land between 4 and 8 groups, calibrated against the target context length and batch size.

Where Things Diverge: DeepSeek’s MLA

The clearest architectural divergence from the GQA consensus in recent models is DeepSeek’s Multi-head Latent Attention (MLA), introduced in DeepSeek-V2. MLA compresses the KV cache using low-rank decomposition. Instead of caching full key and value tensors per head, the model caches a compressed latent vector of much smaller dimension, then reconstructs keys and values at attention time using learned up-projection matrices.

This is structurally different from GQA. GQA reduces KV cache by sharing heads; MLA reduces it by compressing what each head stores. In DeepSeek-V2, the compression ratio reaches roughly 32x compared to standard MHA. The tradeoff is additional compute at inference for the up-projections, but with matrix fusion this overhead is largely absorbed. DeepSeek-V3 carried MLA forward alongside a highly fine-grained MoE design, and the combination enabled larger effective batch sizes at comparable memory budgets.

MLA is novel enough that it does not appear in most architecture surveys. Seeing it in Raschka’s gallery alongside GQA variants makes the design space legible in a way that reading individual papers in isolation does not.

Mixture of Experts as a Structural Fork

Dense and MoE models now coexist openly, and they represent a genuine philosophical split about how to allocate a parameter budget. Mixtral 8x7B, Qwen-MoE, and DeepSeek-V3 all share the same premise: match parameter count to model quality, but keep per-token FLOPs constant by activating only a fraction of parameters per forward pass.

The gallery makes clear that MoE is not a single design. Expert granularity varies substantially. Mixtral uses 8 large experts and routes to 2 per token. DeepSeek-V3 uses 256 small experts with 8 activated per token, plus a set of shared experts that always activate regardless of routing decisions. These are not minor variations; they affect training stability, throughput, whether the model develops specialization across experts, and how gracefully it degrades under unbalanced routing.

The routing auxiliary loss, used to prevent expert collapse, also differs across implementations. Some models penalize imbalance softly; others use hard routing constraints. The interaction between routing strategy and the granularity of expert design is one of the least-settled areas in the open-weight model landscape.

Architecture tables tend to be treated as reference material: look up whether LLaMA 3 uses GQA, confirm Mistral uses RoPE, close the tab. The gallery supports that use case well. But the value is in scanning the full picture and noticing what varies and what does not.

Vocabulary sizes span from 32,000 tokens in early LLaMA models to 152,000 in some recent releases. Base context lengths range from 4,096 to 128,000 tokens. FFN dimension multipliers differ. Decisions about whether to tie embedding weights between input and output layers differ. These choices have measurable effects on memory, throughput, and downstream fine-tuning behavior.

The convergence on RMSNorm, RoPE, SwiGLU, and GQA is real, and it tells you something: those choices have cleared the bar of “strictly better in practice.” There is now enough empirical evidence and enough independent replication that implementing something different carries a burden of justification. The divergence on MoE topology, context extension strategy, and KV compression tells you something different: those spaces are still actively contested, and the winning design is not yet obvious.

For anyone building on top of these models, fine-tuning them, or trying to understand why a base model behaves differently from another, that distinction is practically useful. The gallery gives you a map of where the field has made up its mind and where it has not.

Was this interesting?