What Nine Million Parameters Teach You About Transformers

The guppylm project is a roughly 9-million-parameter language model built from scratch in about 130 lines of PyTorch, trained on 60,000 synthetic conversations, and capable of completing a full training cycle in about five minutes on a free Google Colab T4 GPU. The model has a fish personality and believes the meaning of life is food. That last detail matters more than it sounds.

What the architecture looks like at 9M parameters

A vanilla transformer at 9 million parameters is a constrained design space. The “Attention Is All You Need” paper introduced a base model with 65M parameters and a large variant at 213M. Getting to 9M while keeping the transformer structure intact means making deliberate tradeoffs across every dimension.

A typical configuration at this scale uses 4 to 6 transformer layers, 4 to 8 attention heads, an embedding dimension around 256 to 384, and a feed-forward inner dimension of 4x the embedding size. With a vocabulary of around 8,000 to 32,000 tokens and a context length of 256 to 512 tokens, the parameter count assembles roughly as follows:

Token embeddings: vocab_size × d_model. At 8K tokens and d_model=256, that is about 2M parameters.
Per-layer attention: four weight matrices (Q, K, V, and output projection), each d_model × d_model. At 256 dimensions, that is 4 × 65,536 = 262,144 per layer.
Per-layer feed-forward: two linear transforms at d_model × (4 × d_model) and back. At 256 dimensions: 2 × 256 × 1024 = 524,288 per layer.
Six layers with both components produces roughly 4.7M parameters in the transformer blocks alone.

The embedding table is often the largest single contributor at small vocabulary sizes, and at larger vocabularies (32K) it can dominate entirely. At 32K × 384 that is 12.3M parameters before a single transformer layer runs, which is why small educational models either tie the output projection to the input embeddings or use a modest vocabulary. The parameter budget forces clarity about what is expensive.

This is the first concrete lesson from building at small scale: you understand where parameters live. Reading about transformers in the abstract, the embedding table can seem like a bookkeeping detail. Building one where it consumes 40% of your total budget makes the tradeoff concrete and permanent.

Why synthetic conversations work for personality injection

The 60,000 synthetic conversation training set is the part of guppylm that tells you the most about how large-scale fine-tuning works.

Pretraining large models involves enormous corpora of diverse text, which gives the model broad world knowledge but no particular personality or consistent behavioral style. The character that people associate with deployed assistants comes mostly from the fine-tuning stage: supervised fine-tuning on curated conversations, then often reinforcement learning from human feedback. What guppylm demonstrates in miniature is that the format and content of the training conversations shape the model’s expressed behavior almost entirely.

The fish personality that believes the meaning of life is food is not an emergent property of the architecture. It comes from the training conversations, which presumably frame the world consistently from that perspective across thousands of examples. This is the same mechanism by which you give a large model a persona through instruction tuning, except here the entire model is small enough to see the effect clearly and to iterate on it cheaply.

Andrej Karpathy’s nanoGPT trains on OpenWebText or Shakespeare and produces output that reflects those corpora’s register and vocabulary. Change the dataset and the character of the output changes completely while the architecture stays identical. GuppyLM takes this a step further by making the personality explicit and swappable: fork the repo, replace the conversation templates, retrain in five minutes, and get a different character. The project treats training data as the primary interface to the model’s behavior rather than as a fixed artifact.

Stanford’s Alpaca work from 2023 followed the same logic at a larger scale: take a pretrained 7B LLaMA model, generate 52,000 instruction-following examples with GPT-3.5, fine-tune for a few hours, and get a model that follows instructions far better than the base. The fine-tuning data was built around that behavior, so the model expressed it. GuppyLM compresses this entire pipeline into something visible on a single GPU in five minutes, which makes the lesson hard to miss.

The 130-line constraint and what it forces out

The constraint of 130 lines of PyTorch is pedagogically deliberate. PyTorch ships nn.MultiheadAttention, nn.TransformerEncoderLayer, and nn.Transformer, which let you assemble a working model in a dozen lines without understanding any of the internals. Building from scratch at 130 lines forces a different level of engagement.

You have to write the scaled dot-product attention explicitly, assembling the Q, K, V projections and the softmax over the similarity matrix. The feed-forward sublayer and its activation function have to appear as code. Layer normalization, residual connections, positional encoding, all of it must be present and legible:

# Scaled dot-product attention, written out
scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
out = attn @ v

When you write this yourself, the masking step, which prevents the model from attending to future tokens during training, stops being a footnote and becomes something you have to get right or the model learns nothing useful. The dimension mismatch error when you forget to reshape the attention heads is a more durable lesson than any diagram.

Jay Mody’s picoGPT achieved GPT-2 inference in about 60 lines of NumPy, which is an impressive compression exercise but uses a pretrained checkpoint. Building the training loop adds tokenization, a data loader, the language modeling loss, an optimizer with learning rate scheduling, and the forward pass from scratch. The 130-line figure for guppylm includes training, which is meaningfully more complex than inference-only.

llm.c occupies the opposite end of this tradeoff: GPT-2 training in pure C without any deep learning framework. The C version is instructive for different reasons, showing the raw memory layout of tensors, manual backward passes, and the gap between mathematical operation and fast GPU kernel. GuppyLM is positioned for clarity at the architecture level: PyTorch handles gradient computation, keeping the attention mechanism and transformer structure visible without requiring CUDA expertise.

Training on T4 in five minutes

The five-minute training time on a Colab T4 comes from the combination of a small model, a modest dataset, and a short training run. The T4 delivers about 65 TFLOPS of FP16 throughput. Training a 9M parameter model for a few thousand steps with batch size 32 and sequence length 256 involves roughly:

FLOPs per forward pass: approximately 6 × N × T, where N is the parameter count and T is tokens per batch. At 9M params and 32 × 256 = 8,192 tokens per batch, that is around 440M FLOPs per step.
At 10,000 training steps: about 4.4 × 10^12 FLOPs total, which the T4 handles in a few minutes even accounting for overhead.

The training is not long enough to push the model toward general competence on any benchmark. It is long enough to see loss curves drop, coherent short responses form, and the personality take shape. That feedback loop, where you change the conversation templates and rerun in five minutes, is the actual teaching mechanism.

Most learners who work through transformer papers or study architecture diagrams do not internalize the design the way someone who has debugged a dimension mismatch in the multi-head reshaping step does. The short cycle time makes iteration cheap enough that building and rebuilding is a practical learning path rather than an ambitious one.

GuppyLM sits alongside nanoGPT, picoGPT, and llm.c in a small but growing set of minimal LLM implementations that prioritize legibility over capability. The projects differ in language, dependency choice, and target audience, but share the same premise: that building a working language model, even a tiny one with a fish personality, is a more durable form of understanding than reading about one. The fish is convinced the meaning of life is food because the training data said so, consistently, thousands of times. That is how it works at every scale.