Nine Million Parameters Is Enough to Understand Everything

The standard path to understanding how LLMs work goes something like this: read “Attention Is All You Need,” watch a lecture, read some blog posts, nod along to diagrams of multi-head attention, and walk away feeling like you get it. What you actually have is a mental model built out of abstractions stacked on abstractions. You understand the shape of the thing without understanding how the pieces interact under load.

GuppyLM takes a different path. It is a vanilla transformer with roughly 9 million parameters, trained on 60,000 synthetic conversations, implemented in about 130 lines of PyTorch. It trains in five minutes on a free Colab T4 GPU. The fish character it ships with has concluded that the meaning of life is food. This is not a coincidence; it is the whole point.

What 130 Lines Must Contain

One of the most clarifying things about a project like this is the constraint. When you have 130 lines to implement a working language model, you find out fast which pieces are load-bearing and which are optional.

A decoder-only transformer at minimum needs: a token embedding table, positional encodings, some number of transformer blocks (each containing layer normalization, multi-head causal self-attention, and a feed-forward network), a final layer norm, and a linear projection back to vocabulary size. The training loop needs a way to shift targets by one position, compute cross-entropy loss, and run backpropagation. The generation loop needs to sample from the output distribution and feed tokens back in.

That is genuinely it. The core attention computation collapses to four lines of math:

att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
y = att @ v

Everything else is either scaffolding around that, or the feed-forward expansion that gives the model somewhere to store its learned associations. When you see these four lines running live on your own training data, the abstract “scaled dot-product attention” description becomes something you understand differently. You see the causal mask filling future positions with negative infinity. You see the temperature effect of the scaling factor. You see the softmax converting raw scores into a probability distribution over which past tokens to attend to. These are not new facts; you probably already knew them. But watching them execute changes how firmly they sit in your head.

Nine Million Parameters, Sized

A 9M parameter model in this class likely uses roughly six transformer layers, an embedding dimension around 384, and six attention heads. Compare that to GPT-2 small at 117 million parameters: 12 layers, 768-dimensional embeddings, 12 heads. GPT-2 XL sits at 1.5 billion. A modern production model like LLaMA-2 7B has 32 layers and 4,096-dimensional embeddings.

The parameter budget matters because it forces concrete architectural decisions. At 384 dimensions, each attention head gets 64 dimensions to work with, the same ratio GPT-2 uses (768/12). That ratio is not arbitrary; it reflects a practical sweet spot between expressivity per head and total computation. At 9M parameters, you run out of budget quickly. A token embedding table alone, if your vocabulary is 32,000 tokens, costs 32,000 × 384 = 12.3M parameters, which would exceed the entire model budget. That is why small educational models often use character-level tokenization or a severely pruned vocabulary. This kind of thinking, parameter counting, does not come naturally from reading theory. You only start doing it when you are writing the code and PyTorch prints a parameter count that surprises you.

Karpathy’s nanoGPT covers similar ground, and its “baby” configuration (6 layers, 384 dimensions, roughly 10M parameters) is nearly identical in scale to GuppyLM. The llama.c project goes further, shipping pre-trained TinyStories models at 15M and 42M parameters that demonstrate surprisingly coherent story generation on a constrained domain. What GuppyLM adds to this ecosystem is explicit simplicity as a feature: fewer files, a concrete personality demonstration, and a dataset you can understand at a glance.

The Synthetic Conversations Choice

Training on 60,000 synthetic conversations rather than scraped web text is a deliberate choice with real trade-offs.

The upside is control. When you generate your own training data, you know exactly what distribution the model is learning from. If the fish character consistently talks about food, the model will learn to associate that character with food-related outputs. You can verify this directly. There is no mystery about why the model says what it says; the answer is in the training data, and you wrote the training data.

This connects to something that gets lost in discussions of large models: the model’s “personality” and the model’s training data are not separate things. They are the same thing viewed from different angles. GuppyLM makes this unusually legible. The fish thinks food is the meaning of life because every conversation in the training set reinforced that association. Swapping the personality means generating a new dataset with different associations, retraining, and observing the result. The whole loop is short enough to run twice in an afternoon.

The downside is generalization. A model trained on 60K synthetic conversations about a fish has learned to continue that distribution. It has not learned anything about code, history, or mathematics. At this scale, with this dataset, that is fine. The goal is not a useful assistant; it is a working demonstration of the mechanism.

The TinyStories paper from Microsoft Research explored this tradeoff more formally, asking how small a model can be while still generating grammatically coherent, story-consistent text. Their answer, using GPT-3.5 and GPT-4 generated training stories with a vocabulary constrained to words a three-year-old would know, was that models as small as 1M parameters could produce surprisingly readable output on that constrained domain. Constrained distribution, constrained vocabulary, constrained architecture: these three things combine to produce outputs that look impressively capable within the narrow band they were trained on.

Training in Five Minutes

Five minutes on a T4 GPU is fast enough that you will train this model multiple times. That changes your relationship to it.

When training takes hours, you commit to a configuration and wait. You form an opinion before you have seen results. When training takes five minutes, you experiment. You change the number of layers, watch the loss curve shift, run generation, compare outputs. The model becomes a thing you interact with rather than a process you wait on.

The T4 has about 8.1 TFLOPS of FP32 compute and 16GB of memory. A 9M parameter forward pass with a reasonably short sequence length is genuinely trivial for this hardware. Most of the five minutes is probably data loading, tokenization, and the overhead of the Python training loop rather than actual GPU compute. At this scale, the bottleneck is not hardware; it is your own iteration speed.

This matters for learning. When feedback is fast, you can hold a hypothesis, test it, and update your model of how the system works, all inside the same session. The Harvard NLP Annotated Transformer made the original paper legible by adding line-by-line commentary to a working implementation. GuppyLM does something complementary: it makes the full lifecycle, data preparation, training, evaluation, inference, short enough to hold in your head at once.

What This Will Not Teach You

A 9M parameter model trained on synthetic conversations leaves some things genuinely opaque.

You will not learn about RLHF or preference learning, since those require a reward model and a separate fine-tuning phase that only makes sense at scales where base model capabilities already exist. You will not develop intuitions about inference optimization, quantization, KV caching, or speculative decoding; the model is small enough that none of these techniques are necessary. You will not observe emergent capabilities, which are phenomena that appear discontinuously as models scale and simply do not exist at 9M parameters. And you will not learn about tokenization tradeoffs in depth, since the vocabulary here is deliberately small.

None of that is a criticism. These are scope boundaries. GuppyLM is not trying to replicate a production system; it is trying to make the core mechanism visible. For that purpose, 9M parameters and 130 lines is precisely enough.

The project is on GitHub with instructions for forking and swapping the character personality. The barrier to having your own tiny LLM with its own opinions is a free Colab session and a dataset you write yourself. For anyone who has read transformer papers and felt like they understood them but wanted to be more certain, that is a reasonable way to spend an afternoon.