Nine Million Parameters and What They Reveal About Transformer Mechanics
Source: hackernews
There is a specific kind of confusion that comes from reading about transformers without building one. You can follow the math in “Attention is All You Need,” understand the diagram, trace the residual connections, and still have no felt sense of why certain hyperparameter choices matter or where training tends to go wrong. GuppyLM is a deliberate antidote to that: roughly nine million parameters, 130 lines of PyTorch, sixty thousand synthetic training conversations, and a personality convinced that food is the meaning of life.
The lineage here is worth placing. Andrej Karpathy’s nanoGPT is probably the canonical reference for “minimal transformer from scratch” — around 300 lines for a full training and inference loop, capable of reproducing GPT-2 small at 124M parameters if you have the data and compute. Before that, minGPT served a similar purpose with a slightly more structured codebase. At the extreme end, picoGPT does GPT-2 inference in about 90 lines of NumPy with no training loop at all. GuppyLM sits in a different position from all of these: it is not a reimplementation of a known model at reduced code size, it is a ground-up small model designed so that the training loop itself is what you learn from.
What “Vanilla Transformer” Means at This Scale
A vanilla transformer in this context means the architecture from Vaswani et al.’s 2017 paper with no significant modifications: multi-head self-attention, position-wise feed-forward networks, positional encodings, residual connections, and layer normalization. No flash attention, no rotary position embeddings, no grouped-query attention, no mixture-of-experts routing.
At nine million parameters, the configuration has to be modest. A rough reconstruction of the parameter math: with a vocabulary of around 8,000 tokens and an embedding dimension of 256, the embedding table alone accounts for roughly two million parameters. Six transformer layers with four attention heads apiece, each head operating in 64 dimensions, and a feed-forward expansion to 1,024 dimensions adds another 4.7 million or so. The output projection brings it to around nine million total. The numbers compress naturally; this is a model where every component is visible and countable.
That visibility is the point. When you have a model this small, you can instrument every tensor, watch attention patterns form epoch by epoch, observe what happens when you change the number of heads or drop the learning rate by a factor of ten. Larger models are not qualitatively different in structure, but they are quantitatively overwhelming. LLaMA 2 at 7 billion parameters spans 32 layers with a hidden dimension of 4,096; you cannot hold the whole thing in your head the way you can at nine million parameters.
Training on Synthetic Conversations
The choice to train on 60,000 synthetic conversations rather than scraped web text is pedagogically deliberate and practically sensible. Real-world pretraining corpora are messy: inconsistent formatting, tokenization edge cases, domain imbalance, deduplication concerns, and quality filtering headaches. Working through all of that teaches you data engineering, not transformer mechanics. Synthetic data generated with a fixed schema — question, response, personality constraint — means the training signal is clean and the model learns exactly what you intend it to learn.
The tradeoff is that the model’s capabilities are bounded by its synthetic distribution in a hard way. GuppyLM will not generalize to code, poetry, or anything outside the narrow conversational format its training data represents. Generalization, though, is not the goal here. The goal is watching a loss curve fall, seeing perplexity improve, noticing what the model produces before and after a few epochs. For that purpose, 60,000 structured conversations trained in five minutes on a free Colab T4 is close to ideal.
The NVIDIA T4 in Google Colab’s free tier provides 16GB of VRAM. A nine-million-parameter model in 32-bit float precision weighs about 36MB in weights alone, leaving the T4 nearly empty. With a small dataset and a compact model, the training loop runs fast enough that you can iterate on architecture changes in a single afternoon session. That iteration pace is what converts reading about transformers into understanding them.
The Limits of Tiny
There is an honest reckoning to do about what a nine-million-parameter model cannot show you. The scaling laws research from Hoffmann et al. in the Chinchilla paper established that model capability scales predictably with both parameter count and training tokens, and that the relationship is not linear below certain thresholds. Emergent behaviors — coherent multi-step reasoning, reliable in-context learning from a handful of examples, instruction following that generalizes across domains — tend to appear at scales that nine million parameters cannot reach.
This means GuppyLM teaches you the architecture and the training dynamics, but it cannot replicate the phenomena that make large language models surprising. The fish personality is a consequence of training data distribution, not of scale; swap the training conversations and you get a different character. The kinds of coherent long-range reasoning that large frontier models produce require both architectural choices and scale that no toy model demonstrates.
What the tiny model does show you, and what is not obvious from reading, is how sensitive the training loop is to initialization, learning rate scheduling, and the interaction between batch size and gradient accumulation. These are the implementation details that production training pipelines spend considerable engineering effort on, and they are invisible until you have watched a training run diverge and had to diagnose why.
There is also a subtler lesson in what the model gets wrong. A nine-million-parameter model trained on synthetic conversations will produce plausible-sounding responses that are shallow in a detectable way. Watching that shallowness in the outputs, and then reasoning about why scale and data diversity address it, gives you a more grounded intuition for capability scaling than any diagram in a paper.
Why This Approach Has Staying Power
The “build it to understand it” tradition in machine learning has persisted precisely because the gap between mathematical description and running code is where most of the real understanding lives. A transformer described in a paper is a series of matrix multiplications and nonlinearities with some normalization. A transformer you have trained, debugged, watched overfit, tuned the learning rate on, and run inference from is a different kind of knowledge.
GuppyLM compresses the relevant parts of that experience into something achievable in an afternoon. The personality swap is a practical invitation: replace the training conversations with a different character’s dialogue and retrain in five minutes. Adjust the depth and width of the model and watch how capacity affects loss. Extend the architecture with features you have read about but never implemented — perhaps a simple key-value cache, or an alternative positional encoding scheme. At 130 lines of PyTorch, every architectural decision is explicit and every bug has a limited surface area to hide in.
Projects like this serve a different function than research implementations or production code. They are scaffolding for mechanical understanding, and the fish that knows the answer to everything is food is evidence that the scaffold worked.