· 6 min read ·

Reading the Diff: How Modern LLM Architectures Converged and Where They Still Diverge

Source: hackernews

Sebastian Raschka recently published an LLM Architecture Gallery collecting annotated diagrams of the major open-weight language models in one place. The gallery has been making rounds on Hacker News, and the surface reaction is “useful reference.” But spend time with it and a sharper pattern emerges: almost every frontier model published in the last two years is built from the same set of components, with divergences concentrated in a surprisingly small number of design choices.

That convergence is worth understanding on its own terms, and the exceptions matter even more.

How Decoder-Only Won

The 2017 Transformer paper introduced an encoder-decoder architecture for machine translation. Within two years, GPT-1 showed that a decoder-only variant trained on raw text with next-token prediction could learn useful representations. BERT went the other direction, proposing encoder-only bidirectional masking. T5 tried to unify everything into a seq2seq text-to-text framework.

By 2023, the competition was effectively settled. Every significant open-weight general-purpose model released since LLaMA 1 is decoder-only. At large scale, the unidirectional language modeling objective produces models that generalize remarkably well to diverse tasks through prompting and fine-tuning. Encoder-only models are still used for specialized retrieval and classification tasks, but they are not where the frontier moves.

The Modern Standard Block

Across LLaMA 3, Mistral, Gemma, Qwen, Phi, and DeepSeek, the transformer block has converged on a specific configuration that looks nothing like the original paper.

Pre-normalization with RMSNorm. The 2017 Transformer applied layer normalization after the attention and feed-forward sublayers (Post-LN). GPT-2 moved normalization before each sublayer (Pre-LN), which significantly stabilized training at larger scales. RMSNorm then replaced LayerNorm by dropping the mean-centering step, keeping only the scale by root mean square. It runs about 10-15% faster and performs equivalently or better.

Rotary Positional Embeddings (RoPE). The original sinusoidal positional encodings were fixed and did not generalize beyond the training context length. Learned absolute embeddings had the same problem. RoPE, introduced in 2021 by Su et al., encodes position by rotating query and key vectors in the complex plane, so that the dot product between query and key implicitly encodes relative position. It generalizes better to longer sequences, and techniques like YaRN and LongRoPE can extend it further. LLaMA 3.1 uses RoPE scaling to reach 128K context from a base trained at shorter lengths.

SwiGLU feed-forward networks. Noam Shazeer’s 2020 paper on gated linear units showed that replacing the standard two-matrix ReLU or GELU feed-forward sublayer with a gated variant consistently improves perplexity. SwiGLU computes Swish(xW1) ⊙ (xW2) and projects the result, using three weight matrices instead of two. To keep parameter counts equal, the hidden dimension is scaled down by roughly two-thirds. PaLM, LLaMA, Mistral, DeepSeek, Gemma, and Qwen all use this.

Grouped-Query Attention (GQA). This is where inference economics started shaping architecture. The original multi-head attention (MHA) maintains one key and value head per query head, which means the KV cache grows linearly with both context length and number of heads. For a 70B model serving long contexts, this becomes the dominant memory cost. Multi-Query Attention (MQA), proposed by Shazeer in 2019, collapsed K and V to a single shared head, reducing cache size by the number of heads but degrading quality. GQA from Google (2023) struck a middle ground: group query heads and share one K/V pair per group. LLaMA 3 8B uses 32 query heads and 8 K/V heads; LLaMA 3 70B uses 64 query heads and 8 K/V heads. The cache savings are meaningful and the quality hit is minimal.

Put these together and you get the canonical 2024 decoder block: Pre-LN RMSNorm, GQA with RoPE, SwiGLU FFN. If you are implementing a new LLM and have no specific reason to deviate, this is what you build.

Where the Divergences Actually Are

The architecture gallery is most valuable not for the consensus, but for making visible the specific places where models choose to do something different.

Multi-head Latent Attention (MLA). DeepSeek v2 and v3 pushed KV cache compression further than GQA. MLA compresses the key and value projections into a low-rank latent vector, then up-projects per head at attention time. The cache stores the compressed latent rather than the full K/V tensors, reducing memory by roughly 5-13x compared to MHA. The trade-off is a somewhat more complex forward pass, but DeepSeek v3’s results suggest the quality matches GQA while the inference efficiency improves. MLA is the most significant attention-mechanism innovation of the last two years and currently underused outside DeepSeek’s own models.

Mixture of Experts (MoE). Sparse MoE architectures replace each dense FFN with N expert sub-networks and a learned router that sends each token to the top-K experts. Only K of N experts activate per token, so the active parameter count stays manageable while total capacity scales. Mixtral 8x7B from Mistral AI (December 2023) demonstrated this at useful scale: 46.7B total parameters, 12.9B active, outperforming LLaMA 2 70B on most benchmarks. DeepSeek v3 extended this to 671B total parameters with 37B active, using 128 fine-grained routed experts plus 1 shared expert per layer, trained with FP8 mixed precision for roughly $5.5M in compute.

The open questions in MoE design are not settled. Coarse experts (Mixtral’s 8) versus fine-grained experts (DeepSeek’s 128) is a genuine trade-off: fine-grained routing gives the model more specialization opportunities but requires careful load balancing and expert-parallelism infrastructure. Adding shared “always-on” experts, as DeepSeek does, helps maintain general capabilities across all tokens regardless of routing decisions.

Sliding window attention. Mistral 7B introduced alternating local and global attention layers, where local layers attend only to the most recent 4096 tokens. This reduces attention complexity for long contexts at the cost of limiting information flow through local layers. Gemma 2 uses a similar alternating pattern. Mistral dropped sliding window attention in v0.2, suggesting the approach has limits or that efficient full-attention implementations have made the trade-off less compelling.

Normalization placement. Gemma 2 adds a second normalization after each sublayer in addition to the Pre-LN before it, a pattern called double normalization. The motivation is training stability at scale; logit soft-capping, applying tanh to cap attention and output logits at fixed values, is another Gemma 2 stabilization technique not seen in other families. These are empirical discoveries that rarely show up in prominent papers but matter considerably for getting large models to train without diverging.

The Vocabulary Size Signal

One detail the gallery captures that often gets overlooked is tokenizer vocabulary size. GPT-2 used 50,257 tokens. LLaMA 1 used 32,000. LLaMA 3 expanded to 128,256 using a Tiktoken-based BPE tokenizer. Qwen2.5 uses 151,936 tokens, driven by the need for efficient multilingual coverage, particularly for Chinese. Larger vocabularies mean fewer tokens per document, which reduces effective sequence length and therefore memory and compute during training and inference. For long-context applications this matters considerably, and the trend toward larger vocabularies reflects both multilingual ambitions and practical inference economics.

The value of Raschka’s gallery is not primarily documentary. It is that architectural choices which look like implementation details in individual model papers become legible as design patterns when viewed across fifteen models at once. You start to notice that every model using MoE also uses GQA or MLA; that every model targeting very long contexts uses RoPE with some form of scaling; that SwiGLU adoption is essentially universal despite appearing in no particularly prominent standalone paper.

Raschka’s book Build a Large Language Model (From Scratch) takes a complementary approach, walking through implementing a GPT-2-style model in PyTorch token by token and gradient by gradient. The gallery and the book together are probably the most pedagogically useful pair of LLM resources available for practitioners who want to understand the internals rather than treat models as black boxes.

The HN discussion noted some reasonable gaps: SSM-based alternatives like Mamba and RWKV, which challenge the transformer’s O(n²) attention complexity with recurrent architectures, are not fully represented. Hybrid models like Jamba, which interleaves Mamba and transformer layers, occupy an interesting middle ground. Whether these belong in an “LLM architecture” gallery depends on how you define the category, but they are worth tracking alongside it.

The canonical decoder-only block will keep shifting. MLA is likely to spread beyond DeepSeek. MoE will become more common as inference infrastructure catches up with the routing complexity. Context lengths will continue growing, putting more pressure on KV cache management. Having a visual map of where the field is now makes it easier to spot where those changes land.

Was this interesting?