Nine Million Parameters and What They Reveal About Transformers

The tradition of building software to understand it runs deep in programming culture. You write a toy OS to understand memory management, a lexer to understand parsing, a tiny web server to understand HTTP. These projects work because they strip away what production systems add for robustness and scale, leaving only the essential mechanics visible. Large language models have been resistant to this approach, because “large” is in the name and the compute requirements seemed to put the from-scratch build out of reach for most developers.

GuppyLM challenges that assumption. It is a roughly 9-million-parameter transformer, trained on 60,000 synthetic conversations, implemented in about 130 lines of PyTorch, and it trains in five minutes on a free Google Colab T4. The model has a fish personality, trained to believe that the meaning of life is food, and that personality is the point. Fork it and swap in any character you want.

The project sits in a well-established tradition of minimal transformer implementations. Andrej Karpathy’s nanoGPT is probably the most cited example: a few hundred lines of PyTorch that reproduce GPT-2 training at various scales, with a Shakespeare character-level demo that has become a rite of passage for anyone learning language modeling. minGPT preceded it with the same priorities of clarity over performance. More recently, llm.c took the philosophy into C, removing framework abstractions to expose the raw matrix operations. GuppyLM sits squarely in this lineage, but it tilts specifically toward instruction-following behavior on synthetic chat data rather than pure language modeling on text corpora.

What 9 Million Parameters Represent

The number 9M sounds both large and small depending on context. GPT-3 has 175 billion parameters; the smallest published BERT variant has 110 million; Llama 3.2’s smallest release is 1 billion. At 9 million, GuppyLM is well below the range most people associate with “language model,” and the count tells you something concrete about architecture when you work backward through the math.

A decoder-only transformer with embedding dimension d_model, L transformer blocks, and vocabulary size V contains roughly:

Token embedding: V × d_model parameters
Per transformer block: 4 × d_model² for the attention projections (Q, K, V, and output) plus 8 × d_model² for the feed-forward network with a 4x hidden expansion, totaling 12 × d_model² per block
Output head: V × d_model (often weight-tied with the token embedding)

For 9M total with d_model=256, six transformer blocks, and a vocabulary of 8,192 tokens with untied output weights:

Embeddings: 2 × 8192 × 256 ≈ 4.19M
Six transformer blocks: 6 × 12 × 256² ≈ 4.72M
Total: approximately 8.9M, plus bias terms and layer norm parameters bringing the count to 9M

That configuration is consistent with the stated size. The original transformer paper used d_model=512 with eight attention heads; GuppyLM at d_model=256 simply has narrower representations throughout. The structure is identical at every scale from GuppyLM to GPT-4: what changes is d_model (from 256 to 12,288 in large models), the number of layers (6 to 96+), and the number of attention heads. Scale is a quantitative expansion of the same operations, not a qualitative architectural change.

The Transformer Block at Minimum Scale

At 130 lines of PyTorch, there is nowhere to hide complexity. A minimal decoder-only transformer block looks like this:

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(
            d_model, num_heads, dropout=dropout, batch_first=True
        )
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        attn_out, _ = self.attn(x, x, x, attn_mask=mask)
        x = self.norm1(x + self.drop(attn_out))
        x = self.norm2(x + self.drop(self.ff(x)))
        return x

Multi-head self-attention, residual connection, layer norm; feed-forward network, residual connection, layer norm. Stack this block six times with a token embedding and positional encoding at the input, attach an output projection at the end, and you have a complete decoder-only language model. The component inventory is short enough to hold in your head, and at small scale you can inspect the attention weight matrices directly without the memory budget that larger models require.

One detail worth noting: PyTorch’s nn.MultiheadAttention abstracts away the scaled dot-product attention itself, hiding the sqrt(d_k) scaling and the causal masking as internal operations. If you want to see those as explicit code, nanoGPT’s CausalSelfAttention class implements attention manually, which is more verbose but more transparent. GuppyLM trades that visibility for brevity, which is a reasonable choice for an entry-level project but worth being aware of when deciding what you want to learn from the exercise.

Synthetic Data and the Decomposition of Personality

The fish personality serves as both the demo hook and the most instructive design choice in GuppyLM. The model’s personality comes entirely from its training data: 60,000 synthetic conversations in which a fish-character responds to questions through the lens of food, water, and aquatic survival. The model has no concept of “being a fish” in any meaningful sense; it has a distribution over tokens that, given the context of a conversation, produces fish-flavored responses because that pattern was reinforced during training.

The 60,000 conversations are almost certainly generated by a larger model. The Self-Instruct paper formalized this approach in late 2022: use a capable model to generate diverse instruction-response pairs, then train a smaller model on those pairs. Stanford’s Alpaca demonstrated this at 7B scale: 52,000 instructions generated by GPT-3.5-turbo were enough to produce a reasonably capable instruction follower from a Llama base. GuppyLM is the educational reduction of that same pipeline, where the goal is understanding rather than capability.

What the synthetic data approach demonstrates clearly is the decomposition between knowledge and style. The fish model has a coherent conversational style, because the training data consistently modeled that style, but no genuine world knowledge, because the training data did not include it. The weights encoding “how a fish answers questions” and the weights that would encode “what a fish knows about biology” come from entirely different training signal. That separation, which is subtle and contested at the frontier of large model research, is sharp and obvious at GuppyLM’s scale; the model will answer questions about physics in the voice of a fish using information it does not have, which makes the boundary between style and knowledge impossible to miss.

Training Dynamics on a Free GPU

The Colab T4 provides about 8.1 TFLOPS of FP32 compute and 15GB of VRAM. For a 9M parameter model with batch size 32 and sequence length 128, the forward pass costs roughly 2 × 9×10⁶ × 128 ≈ 2.3×10⁹ FLOPs per batch, and the backward pass roughly doubles that. At reasonable GPU utilization this works out to several hundred batches per second, consistent with a five-minute training run over 60,000 examples processed for a few epochs.

The training is fast enough to observe in real time and to iterate on. You can change the learning rate, watch the effect on the loss curve, and understand what that means in concrete terms. You can train to convergence, then keep training past it and watch the model begin to memorize its training examples. These concepts are abstract when described in papers; they become concrete when you can inspect model outputs on held-out prompts as validation loss climbs. At 9M parameters and 60K training examples, changes in training data composition also have a visible and immediate effect: add a few thousand conversations on a new topic and the model’s behavior shifts measurably within minutes.

Comparison with nanoGPT

nanoGPT targets a different learning mode. It trains on real text, which means the model learns statistical regularities of natural language rather than a synthetic personality. At character-level Shakespeare scale it produces plausible-sounding English; at GPT-2 scale on OpenWebText it reproduces the pretraining behavior of an early-generation language model. nanoGPT is the better choice for understanding pretraining dynamics, loss scaling, and the statistical structure of natural language.

GuppyLM’s instruction-following format on synthetic data is more representative of how modern chat models are trained. Production LLMs go through pretraining on raw text, then supervised fine-tuning on instruction-response pairs, then alignment steps. GuppyLM collapses the pretraining stage and works directly with instruction-response pairs, which means the model learns conversational format quickly but has no broad world knowledge. That is a useful trade for the educational goal: understanding why instruction tuning produces a different kind of model than pure text prediction, without needing to run a GPT-2 scale pretraining job to see the difference.

The 130-line implementation is shorter than nanoGPT’s training script partly because it delegates more to PyTorch built-ins. Both approaches have merit; the question is what you want to learn. If you want to see scaled dot-product attention as explicit matrix operations, implement it manually. If you want to see how synthetic instruction data shapes model behavior with minimal surrounding code, GuppyLM gets you there faster.

What Building This Teaches

The practical lesson from GuppyLM, and from similar minimal implementations, is that the transformer architecture is not magic. The components are individually simple: linear projections, softmax, elementwise addition, layer normalization. The emergent capability of large models comes from training at scale on broad data, not from architectural complexity that is inaccessible at small scale.

When you interact with a large language model through an API, the model appears to understand your question in some deep sense. Building GuppyLM makes clear that no operation in the forward pass corresponds to “understanding” in a way that maps onto human cognition. The fish says food is the meaning of life because the weights that minimize training loss over fish-personality conversations produce that output when given that input. The computation is deterministic, mechanistically interpretable, and has nothing in it that resembles belief at the operational level.

Understanding this mechanism is a prerequisite for building useful things on top of these systems and for reasoning clearly about what they can and cannot do. The HN thread notes that nine million parameters is not representative of modern models, which is true; GuppyLM’s purpose is pedagogical rather than representative. It is a hands-on explanation of the transformer compressed to five minutes on a free GPU, and for that purpose the fish is well-chosen.