What Nine Million Parameters Actually Teach You About Language Models
Source: hackernews
There is a specific kind of understanding you get from building something small and complete that reading papers and API docs cannot provide. Guppylm is a recent Show HN entry that hit 463 points by doing exactly that: a vanilla transformer, roughly nine million parameters, trained on sixty thousand synthetic conversations, fitting in about 130 lines of PyTorch, and completing a training run in five minutes on a free Colab T4 GPU. The fish character it learns to embody concludes that the meaning of life is food.
The project is worth looking at carefully, not because it introduces new ideas, but because the specific tradeoffs it makes are instructive.
What Nine Million Parameters Actually Contains
The number nine million sounds arbitrary until you work out where it comes from. A vanilla transformer has a handful of major parameter sources: token embeddings, multi-head self-attention projections, feed-forward sublayers, and output projections (often weight-tied to the input embeddings in small models).
For a model at this scale, a plausible configuration might be six transformer layers with a model dimension of 256, four attention heads, and a feed-forward dimension of 1024. The math runs like this:
Embedding (vocab=8192, d_model=256): ~2.1M params
Per layer:
Attention (Q, K, V, O projections): 4 × (256 × 256) = 262K
Feed-forward (up + down projection): 2 × (256 × 1024) = 524K
Layer norms (negligible): ~1K
Per-layer subtotal: ~787K
6 layers: ~4.7M params
Output projection (tied weights): 0 additional
Total: ~6.8M params
Get slightly more generous with the vocabulary or add positional embeddings and you land near nine million. What this breakdown reveals is that the embeddings and the feed-forward layers dominate. The attention mechanism itself, the thing everyone focuses on when reading about transformers, is a smaller fraction of the parameter budget than the two linear layers in each feed-forward block.
This is a counterintuitive fact that becomes obvious only when you sit down and write the parameter count out. Reading the “Attention Is All You Need” paper gives you the architecture diagram. Building the model gives you the parameter distribution.
Why Synthetic Conversations Instead of Raw Text
The dataset choice is more deliberate than it first appears. Most educational LLM projects, including Andrej Karpathy’s nanoGPT, train on character-level or token-level text corpora: Shakespeare, Wikipedia, the Tiny Stories dataset. These are natural choices because they minimize the preprocessing pipeline and let you focus on the model.
But training on raw text teaches the model to continue text, not to converse. When you want to understand how chat models work, the training distribution matters. A model trained to complete Shakespeare will complete text in a Shakespeare-like manner. A model trained on conversational turns, even synthetic ones, learns a different conditional structure: given a human message, generate an assistant-style response.
Generating sixty thousand synthetic conversations is a non-trivial design decision. It means the author made choices about conversation structure, turn format, topic distribution, and how to encode the fish persona into the data. The personality emerges from the training distribution, not from any special architectural mechanism. The fish thinks life is food because the training data said so, repeatedly, in varied forms. This is actually a clean demonstration of how system prompts and fine-tuning interact with base model behavior in production chat models, compressed into a form small enough to inspect.
The Tiny Stories dataset from Microsoft Research took a similar approach in 2023, generating 2.2 million short stories using GPT-3.5 and GPT-4 to train small models in the 1M to 33M parameter range. The insight was that synthetic data with consistent structure and vocabulary makes it feasible for small models to produce coherent output, because the distribution is simpler than natural web text.
The 130-Line Constraint
In PyTorch, a basic transformer block requires a multi-head attention layer, a feed-forward sublayer, two layer normalizations, and residual connections. The forward pass through a single block looks roughly like this:
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, num_heads,
dropout=dropout, batch_first=True)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, attn_mask=None):
attn_out, _ = self.attn(x, x, x, attn_mask=attn_mask)
x = self.norm1(x + self.dropout(attn_out))
x = self.norm2(x + self.dropout(self.ff(x)))
return x
That is about fifteen lines for one block. Stack six of them, add token embeddings, positional embeddings, an output projection, a causal masking utility, a training loop, and a sampling function, and 130 lines is genuinely tight. It means you probably do not have separate files for tokenization, learning rate scheduling, or gradient clipping utilities. Everything is visible in one place.
This is pedagogically valuable. When you use a library like Hugging Face Transformers, the GPT-2 implementation alone spans multiple files and thousands of lines, handling edge cases, device abstractions, and compatibility requirements. That complexity obscures the core ideas. A 130-line implementation forces decisions: use PyTorch’s built-in nn.MultiheadAttention rather than implementing scaled dot-product attention from scratch, use a simple character or subword tokenizer rather than BPE, skip learning rate warmup or implement it in three lines.
Each of those shortcuts is a teachable moment, because understanding why production implementations do not take the shortcut is exactly the understanding you are building.
Five Minutes on a T4
The NVIDIA T4 is a 16GB GDDR6 card with roughly 65 TFLOPS of FP16 throughput and 8.1 TFLOPS in FP32, available free through Google Colab’s standard tier with usage limits. Training a 9M parameter model for five minutes on it implies a fairly efficient setup: likely mixed precision (FP16 or BF16), a batch size in the range of 32 to 128, sequences capped at 128 or 256 tokens, and somewhere between one and three epochs over the 60K conversation dataset.
At this scale, the T4 is not being pushed. A 9M parameter model fits comfortably in a few hundred megabytes of VRAM including activations. The training is fast enough that you can iterate on hyperparameters without waiting. This is the pedagogical sweet spot: slow enough that you can observe training dynamics (loss curves, overfitting behavior, the effect of learning rate), fast enough that iteration is not painful.
For comparison, nanoGPT’s smallest practical configuration (the GPT-2 scale, 124M parameters) takes roughly 20 to 30 minutes on a single A100 to train from scratch on the OpenWebText corpus. Karpathy’s character-level Shakespeare demo runs faster, but the dataset is tiny and the model correspondingly small. GuppyLM sits between those extremes: large enough to learn something interesting (actual turn-based conversation), small enough to train in a coffee break.
What This Kind of Project Is Actually For
The transformer architecture has been described, visualized, and explained in dozens of blog posts since 2017. Lilian Weng’s “The Transformer Family” is comprehensive. Jay Alammar’s illustrated guides are clear. The original paper is readable. None of them substitute for having written a training loop that produces NaN loss because your embeddings were not initialized correctly, or watching a model overfit a small dataset and then figuring out that your causal mask was wrong.
The specific contribution of a project like guppylm is not the architecture or the dataset or even the personality gimmick, though the fish is a nice touch. It is the demonstration that building something real and inspectable at this scale is within reach of anyone with a Colab account and a few hours. The fork-and-swap-the-personality invitation is the right one: the clearest path to understanding what is in those 9M parameters is to change the training data and watch what changes in the outputs.
The synthetic conversation approach also opens a path that character-level text projects do not. You can generate training data for any persona, any domain, any conversational style, and retrain in five minutes. That is a tighter feedback loop than most production ML workflows, which is exactly what makes it useful for learning how these systems actually work.