Nine Million Parameters Is Enough to Actually Understand Language Models

GuppyLM is a nine-million-parameter language model trained on sixty thousand synthetic conversations, and the whole thing fits in roughly thirty-six megabytes at float32 precision. The fish character it plays believes the meaning of life is food. The project is about four hundred lines of Python in total, trains in five minutes on a free Colab T4 GPU, and is explicitly designed to be forked and re-skinned with a different personality. It scored 463 points on Hacker News, which is a reasonable signal that it filled a gap.

The gap it fills is not a new one. Understanding a transformer by reading the Vaswani et al. paper, or by using GPT-4 through an API, gives you two different kinds of confusion. The paper is rigorous but abstract. The API hides every interesting decision behind an HTTP endpoint. Building a tiny model forces you to resolve all the ambiguity in one go, because nothing will run until it is correct.

The Educational Transformer Lineage

GuppyLM sits in a lineage of intentionally small, intentionally legible transformer implementations. Andrej Karpathy’s minGPT, released in 2020, was probably the first widely-used implementation in this genre: a clean decoder-only transformer, around a thousand lines, designed for understanding rather than production throughput. Karpathy rewrote it as nanoGPT in 2022, optimizing for minimal code weight and direct correspondence with the GPT-2 paper. His accompanying YouTube lecture accumulated millions of views and became one of the most-cited practical resources in the field.

What GuppyLM adds to this lineage is a shift in framing. nanoGPT trains on text corpora and produces text in the same style. GuppyLM trains on synthetic conversations and produces a character. That is a small difference in training data format but a meaningful difference in what the builder is trying to learn: not just how transformers predict tokens, but how they encode personality through the statistical regularities in a dataset.

What Nine Million Parameters Actually Looks Like

A vanilla decoder-only transformer at this scale looks roughly like this in PyTorch:

import torch
import torch.nn as nn
import torch.nn.functional as F

class GuppyAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.q = nn.Linear(d_model, d_model)
        self.k = nn.Linear(d_model, d_model)
        self.v = nn.Linear(d_model, d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, C = x.shape
        q = self.q(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        k = self.k(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        v = self.v(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        # Causal mask: each token only sees tokens before it
        mask = torch.tril(torch.ones(T, T, device=x.device)).unsqueeze(0).unsqueeze(0)
        attn = (q @ k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attn = attn.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(attn, dim=-1)
        return self.out((attn @ v).transpose(1, 2).contiguous().view(B, T, C))

This is the core of it. Everything else in a vanilla transformer, the feed-forward sublayer, the layer normalization, the residual connections, the output projection, composes around this piece. The causal mask is the line that makes this a language model rather than a bidirectional encoder: each position can only attend to previous positions, so during training you can process a whole sequence in parallel while still teaching the model to predict each token from only what came before it.

Now the parameter arithmetic. With a vocabulary of around ten thousand tokens, an embedding dimension of 256, four transformer layers, and learned positional embeddings over a context window of 256 tokens:

Token embeddings: 10,000 × 256 = 2.56M parameters
Positional embeddings: 256 × 256 = 65K parameters
Per layer: four linear projections for attention (4 × 256²), a two-layer feed-forward block (256 → 1024 → 256), and two layer norms, totaling roughly 790K per layer
Four layers: 3.16M parameters
Output head (unshared from embedding): 2.56M parameters

Total: approximately 8.3M, close enough to nine million with some biases included. The notable thing is that the embedding matrices account for more than half the parameter budget. In a seven-billion-parameter model, the attention and feed-forward weights completely swamp the embeddings. At nine million parameters, you have a model where the vocabulary representation is nearly as expensive as the computation over it. This matters for what the model can and cannot learn: it can memorize the statistical texture of your training corpus, but it has very limited capacity for multi-step reasoning.

Five Minutes on a T4

The NVIDIA T4 is the GPU Google Colab distributes for free. It has 16GB of GDDR6 memory and delivers around 65 TFLOPS for FP16 tensor operations. A nine-million-parameter model occupies roughly 36MB at float32. The entire model, optimizer state, and a batch of training data fit easily in memory, with headroom to spare.

Five minutes of training time for sixty thousand conversations is plausible given these numbers. Each conversation in the dataset is presumably short, a few turns at most, so the total token count might be in the range of two to five million tokens. At the T4’s throughput, with a reasonably large batch size, that volume of data can pass through a model this size several times over in five minutes.

The speed matters pedagogically. A training loop that takes five minutes gives you fast iteration cycles. You change the learning rate, retrain, see the loss curve, and develop intuitions quickly. Training something for six hours and then discovering you had a bug in your attention mask is a different and worse learning experience.

Encoding a Personality Through Synthetic Data

The sixty thousand synthetic conversations that train GuppyLM’s fish character are doing something worth examining separately. At nine million parameters, this model has limited capacity, which means it generalizes aggressively from whatever patterns appear consistently in the training data. If sixty thousand conversations all consistently express the belief that food is the purpose of existence, that belief will be woven into nearly every response the model produces, because there is no room in the model to hold it any other way.

This is a small-scale version of what happens in instruction fine-tuning for large models. Projects like Stanford Alpaca (52K synthetic instruction-response pairs from GPT-3) and Dolly (15K human-written instructions) demonstrated that a relatively small set of high-quality, consistently-formatted conversations can dramatically shift a pretrained model’s behavior. For a tiny model trained from scratch, the effect is even more pronounced: there is no large pretraining to override, so the synthetic data defines everything.

The fork-and-swap design follows from this. To create a different personality, you generate a new set of synthetic conversations consistent with that personality, swap out the dataset, and retrain for five minutes. The architecture does not change; the vocabulary does not change; only the statistical signal changes. The resulting model will have absorbed that signal thoroughly, because it has nowhere else to put it.

What You Cannot Learn From the API

Using GPT-4 through an API teaches you how to prompt. Building GuppyLM teaches you why temperature works: you see that it divides the logit vector before softmax, making the distribution sharper or flatter, and you understand immediately why setting temperature to zero is not the same as maximally confident output but rather greedy argmax selection.

You learn why context length is expensive: the attention computation scales quadratically with sequence length, and you see this in the code rather than in a blog post. You learn what teacher forcing means and why it diverges from inference: during training you feed ground-truth tokens as inputs regardless of what the model predicted, which means the model never sees its own errors compounding. You learn what perplexity is by watching cross-entropy loss decrease and then generating text to see whether it improved.

These are not things you can absorb from documentation. They require a running system with modifiable code.

The Right Scale for Learning

Nine million parameters sits in a useful pedagogical range. It is large enough to produce coherent, contextually-responsive text given the right training data, and small enough to train on free hardware in minutes. It is small enough that you can read every line of the implementation without losing track of the whole, and large enough that the techniques involved are the same techniques used in models a thousand times larger.

Karpathy’s nanoGPT occupies a similar range, and its impact on ML education has been substantial. GuppyLM’s particular contribution is shifting the target from text style to character, which is a more concrete and immediately legible output to evaluate: either the fish acts like a fish or it does not, and the reasons why are visible in the training data.

If you have spent time reading about transformers without fully internalizing how they work, building something at this scale is a direct path to fixing that. The repository is designed to be approachable, and the five-minute training cycle means you can experiment without committing a weekend to it. The fish’s philosophy of life is not the point. The point is that after you have trained it, you will know exactly where that philosophy came from.