Nine Million Parameters Is Enough to Understand How Language Models Work
Source: hackernews
The category of “build a tiny LLM to understand the big ones” has been growing steadily since Andrej Karpathy published minGPT in 2020. A recent entry in this tradition, guppylm, which surfaced on Hacker News to considerable interest, distills the exercise to approximately 9 million parameters, 60,000 synthetic conversations, and roughly 130 lines of PyTorch that train in five minutes on a free Colab T4. The model has a fish personality and asserts that the meaning of life is food. That framing is pedagogically more significant than it might appear.
The Lineage
Karpathy’s minGPT established the template: a clean, decoder-only GPT-2-style transformer with no production infrastructure, just the essential model definition in a few hundred lines of Python. In 2022, nanoGPT followed with Flash Attention, gradient checkpointing, multi-GPU support via DDP, and the ability to reproduce GPT-2’s 124M parameter results. The accompanying two-hour YouTube lecture reached a wide audience and established the form for a generation of derivative projects. In 2024, Karpathy pushed further with llm.c: GPT-2 training in roughly 1,000 lines of C, moving the lesson from PyTorch abstractions down to memory layout and CUDA kernel design.
GuppyLM fits between nanoGPT’s scope and more stripped-down inference-only approaches like picoGPT, which accomplishes next-token prediction in about 90 lines of NumPy but omits a training loop entirely. At 130 lines with full training included, guppylm is a more complete artifact. Nine million parameters places it at a scale where training is fast but outputs are coherent enough to study meaningfully, and the free-tier T4 constraint keeps the barrier to entry at zero.
What 130 Lines Actually Contains
A vanilla transformer at this scale decomposes into a small number of building blocks. The embedding layer maps token IDs to dense vectors; a learned positional embedding adds position information. At 9M parameters, the embedding table represents a meaningful fraction of total weight count, which gives immediate intuition for why vocabulary size and embedding dimension are real architectural decisions rather than incidental choices.
The core operation is multi-head causal self-attention. Each head computes:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
The causal mask prevents position i from attending to positions j > i, making autoregressive generation possible. In practice, the mask is an upper-triangular matrix of negative-infinity values added to the attention scores before the softmax, zeroing out future-token probabilities:
attn = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
attn = attn.masked_fill(mask, float('-inf'))
attn = F.softmax(attn, dim=-1)
out = attn @ V
The feed-forward block is two linear layers with GELU activation, expanding to four times the model dimension and contracting back. This 4x ratio comes directly from the original 2017 paper and has remained standard across essentially every transformer variant since.
Layer normalization placement matters more than documentation typically suggests. GPT-2-style pre-normalization, applied before each sublayer rather than after, substantially stabilizes training. The residual connection pattern looks like:
def forward(self, x):
x = x + self.attn(self.ln1(x)) # pre-norm, then residual
x = x + self.ff(self.ln2(x))
return x
Residual connections allow gradients to flow cleanly through many layers during backpropagation. Remove the x + prefix from either line in a 6-layer model and training largely stops progressing. These are facts that are easy to state as propositions; implementing them and then removing them to observe the collapse is a different kind of understanding.
The Synthetic Conversations Are the More Interesting Choice
Most educational LLM projects train on TinyShakespeare: approximately 1MB of Shakespeare plays, 65 unique characters, results in minutes. It functions well as a technical benchmark, and the outputs are memorably incoherent Elizabethan text. What it does not demonstrate is how a model acquires a particular personality or domain character.
GuppyLM used 60,000 synthetic conversations structured to establish a fish-themed worldview. This is a more instructive choice because it makes visible the mechanism by which language models acquire character.
A model trained on conversations where food is the answer to philosophical questions will internalize that pattern. When it tells you the meaning of life is food, it is sampling accurately from the distribution it was trained on. The personality is embedded in the weights from the training data forward; it is not a filter applied afterward or a configuration option layered on top of a neutral base model. This point tends to remain abstract when discussed in the context of large models, where pretraining runs to trillions of tokens of mixed web text and the contributing factors are opaque. Building a small model on focused synthetic data makes the same principle concrete and traceable.
The project explicitly invites you to fork it and substitute your own training data. The fish is a default, not a fixed identity. The same mechanism will produce whatever conversational character you put in its place, which is a more grounded way to understand what fine-tuning and RLHF accomplish at large scale than any documentation description offers.
What Implementation Reveals That APIs Don’t
Several aspects of language model behavior are genuinely difficult to grasp through API interaction alone. Temperature is documented as “controlling randomness,” which is accurate but not illuminating. In the implementation, temperature is a scalar divisor applied to the logits before the final softmax: higher values flatten the probability distribution across the vocabulary, lower values sharpen it. The line logits /= temperature in the sampling loop provides a different kind of understanding than the parameter description in a reference guide.
Context window limits have a similar character. The attention computation is O(n²) in sequence length because every position attends to every prior position. At small scale this is tangible: training with a context length of 1024 costs roughly four times the attention memory of 512. This is the physical reason context windows were historically bounded, and why extending them efficiently required new architectural approaches, from sliding window attention to state-space models like Mamba. The constraint is visible in a hundred lines of code; in an API it is just a number in the documentation.
Performance degradation at long context in standard transformers follows from the same mechanism. Attention patterns become harder to maintain cleanly as the sequence grows, not because the model forgets in a human sense, but because the attention matrix becomes noisier as the denominator of the softmax grows. Watching a small model’s coherence degrade as you extend context past its training length makes this concrete in a way that benchmarks on large models cannot.
Five Minutes Is a Curriculum Design Choice
Training on a free T4 in five minutes is not incidental to the project’s value. The T4 has 16GB of VRAM, more than sufficient for 9M parameters, but training duration determines how many experiments fit in a session. Five-minute training allows ten experiments per hour. You can double the layer count, observe the loss curve shift, restore the original configuration, adjust the learning rate, induce instability, and recover from it, all within a single afternoon.
This iteration speed changes the relationship between the practitioner and the training loop. Projects in this genre have converged on roughly this target because it transforms training from something you observe once into something you actively explore. TinyShakespeare was designed to be trainable on a laptop CPU in minutes. NanoGPT’s documentation emphasizes fast iteration on readable results. GuppyLM continues that emphasis with the free-tier Colab constraint as an explicit design boundary.
The broader tradition of educational LLM projects that has grown since 2022 reflects something real about how understanding develops in this domain. Reading the transformer paper and using an API leaves a gap that narrows meaningfully only when you implement the attention mechanism yourself, observe training dynamics firsthand, and can explain in concrete terms why removing layer normalization causes training to fail. The fish that believes food is the meaning of life is the end product of that process, but the process is the point.