Nine Million Parameters Is Enough to Understand How LLMs Work

Nine million parameters. That is small enough to train for free in five minutes on a Colab T4, yet large enough to produce grammatically coherent text with a distinct personality. The guppylm project is a deliberate exercise in minimalism: a vanilla transformer, 60K synthetic conversations, and roughly 130 lines of PyTorch. The model plays a fish convinced that the meaning of life is food.

The joke is the point. When the personality of your model is that specific and that absurd, you know the training worked. The outputs are not generic. They are fish-brained, consistently so, and that consistency is what confirms the model has learned something beyond random word association.

The Educational Tradition

This project sits in a well-established lineage. Andrej Karpathy released makemore and then nanoGPT as explicit teaching tools, both emphasizing that you learn more from implementing attention yourself than from reading the original “Attention Is All You Need” paper ten times. The nanoGPT codebase trains a GPT-style model on Shakespeare in a few hundred lines of Python. Karpathy followed that with llm.c, a C implementation of GPT-2 training that makes the memory layout explicit in a way Python abstracts away. Each project strips something away to reveal something else.

guppylm strips away scale. Nine million parameters puts it well below GPT-2 Small (117M parameters) and roughly in the range of the smallest models people train for character-level language generation. At that size, every design decision is visible. You are not debugging distributed training or managing gradient checkpointing. You are watching loss curves go down in real time and seeing exactly how much capacity the model is burning on your training set.

What 130 Lines Actually Contains

A vanilla transformer decoder in PyTorch is not trivial to write from scratch, but it is tractable. Those 130 lines cover token embeddings mapping vocabulary indices to dense vectors, positional encodings injecting sequence position information, multi-head self-attention with causal masking so each token only attends to prior tokens, feedforward sublayers applying a two-layer MLP with a nonlinearity, layer normalization before each sublayer, a language modeling head projecting back to vocabulary logits, and cross-entropy loss over next-token predictions.

Here is the conceptual core of what multi-head causal attention looks like at this scale:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, self.num_heads, self.head_dim)
        q, k, v = qkv.unbind(2)
        attn = (q @ k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attn = attn.masked_fill(
            torch.tril(torch.ones(T, T, device=x.device)) == 0,
            float('-inf')
        )
        attn = F.softmax(attn, dim=-1)
        return (attn @ v).reshape(B, T, C)

That masked softmax is the whole trick. Everything else in the file is scaffolding around that operation.

At 9M parameters, the model is likely using between 4 and 8 transformer layers, 4 to 8 attention heads, and an embedding dimension somewhere in the 256 to 512 range. The exact configuration matters less than the ratio: depth versus width versus context length all trade off against each other, and at this scale you can run ablations and see the effect within the same training session.

The Training Data Choice

Sixty thousand synthetic conversations is a deliberate constraint. Karpathy trained nanoGPT on Shakespeare because it is public domain, stylistically consistent, and small enough that the model can memorize large portions of it without generalization pressure. guppylm uses synthetic conversation data for a different reason: it needs to train a character, not just a style.

Synthetic data for a small model has real advantages. There is no noise from web scraping, no contradictory signal from diverse human authors, and the personality signal is dense enough that even a 9M parameter model can pick it up. The tradeoff is that the model will not generalize to topics outside the training distribution. Ask the fish about networking protocols and you will get something broken. Ask it about food and you will get something surprisingly coherent.

This is not a limitation to paper over. It is a demonstration of how LLMs actually work. The model does not understand anything; it has learned a conditional probability distribution over tokens, shaped by the character of the training data. The fish personality is not a feature bolted on after training. It is baked into the probability mass.

What You Learn That Papers Cannot Give You

Reading about attention mechanisms and implementing them are different activities. The implementation forces you to handle things papers gloss over.

Causal masking. The transformer paper describes it, but writing the torch.tril mask and watching what happens when you forget it, the model cheating by looking ahead and achieving near-zero loss with no real learning, teaches you why it exists in a way description cannot.

Loss dynamics. Cross-entropy loss on a vocabulary of size V starts at roughly log(V), which is the loss of a uniform random predictor. Watching it drop fast, then plateau, tells you exactly how much the model has learned versus memorized. At 9M params on 60K conversations, you will likely see overfitting. That is useful information, not a failure.

Context window effects. A small context window makes the model forget earlier tokens. A large one trains slower and burns more memory. At 9M parameters you can afford to experiment with both within a free Colab session and observe the tradeoff directly.

The feedforward layer’s role. Attention is the mechanism that pulls information across sequence positions. The feedforward sublayer applies per-position transformations. Understanding their distinct roles becomes clearer when you can disable one and watch what breaks, something that is not practical at GPT-4 scale.

None of this is new science. Researchers understood all of it in 2017. But for an engineer building intuition about why these systems behave the way they do, a 5-minute training run with visible loss curves is a faster path than any paper.

The Gap Between Nine Million and Seven Billion

The common objection to educational LLM projects is that the lessons do not transfer to production scale. A 9M parameter model training on a single GPU cannot teach you about distributed training, gradient checkpointing, RLHF, or inference serving infrastructure. That is true.

But that criticism misidentifies the purpose. The gap between using a model and understanding one is large, and it is pedagogically damaging. Engineers who build on top of LLMs without understanding the mechanism tend to anthropomorphize outputs, misread failure modes, and make poor architectural choices when designing LLM-powered systems. A model that consistently believes food is the meaning of life is a concrete demonstration that personality is a property of the training distribution, not of some emergent understanding the model has developed.

Projects in the nanoGPT lineage serve the purpose of closing the conceptual gap. You do not need to train a 7B model to understand what attention is doing. You need to write the masked softmax once, watch the loss curve once, and inspect a few output samples at different training checkpoints. After that, you have something durable.

The Hacker News thread on guppylm, which garnered over 460 points, reflects genuine appetite for this kind of project. Comments consistently land on the same point: reading The Illustrated Transformer is good, but building one is better.

Running It Yourself

The practical barrier is low. A free Colab notebook with a T4 GPU, five minutes of wall time, and a public repository is about as accessible as this kind of project gets. The suggestion to swap the personality by swapping the training data is the right exercise: take the synthetic conversation format, generate your own 60K examples with a different character, and retrain. Watch what changes in the outputs.

At 9M parameters, the model will learn your character’s vocabulary and speech patterns but not factual knowledge. That distinction, between statistical style and factual grounding, is one of the more important things to internalize before building anything serious on top of a language model. The fish already demonstrates it. It just thinks everything important relates to food.