· 6 min read ·

What 130 Lines of PyTorch Actually Teach You About Language Models

Source: hackernews

There is a particular kind of understanding that only comes from building something yourself. Reading about attention mechanisms is useful. Using a pre-trained model through an API is useful. Neither of them is the same as watching a loss curve drop for the first time on a model you assembled line by line.

GuppyLM is the latest entry in what has become a recognizable genre: the from-scratch educational language model. Nine million parameters, roughly 130 lines of PyTorch, 60,000 synthetic conversations, trains in five minutes on a free Colab T4. The personality is a fish who believes the meaning of life is food. Fork it and replace the fish with whatever character you want.

This kind of project gets dismissed in some circles as a toy. That dismissal misses the point entirely.

The Lineage

Andrej Karpathy established the template for this genre with minGPT and later nanoGPT. The goal was never to build something competitive with production systems. It was to strip the architecture down until every component was visible and traceable. nanoGPT accomplishes this in roughly 300 lines across two files: model.py and train.py. Its “baby GPT” configuration runs with 6 layers, 6 attention heads, a 384-dimensional embedding space, and a 256-token context window, landing at around 10 million parameters. The accompanying YouTube lecture is probably the single most-watched resource for understanding the transformer from first principles.

Karpathy also built makemore, which takes a different pedagogical path: it starts with a bigram model, then builds up layer by layer to a full transformer, training on a list of names. The progression matters. You see exactly what each architectural addition contributes. Later came llm.c, which implements GPT-2 training in pure C, for people who wanted to understand the systems layer underneath PyTorch.

At the minimal extreme, picoGPT by Jay Mody implements GPT-2 inference in about 90 lines of NumPy. No PyTorch, no autograd. Just matrix multiplications and softmax. It does not train, only infers, but the clarity it provides about what inference actually is has made it a reference point in its own right.

GuppyLM occupies a distinct niche within this lineage. It is not trying to reproduce GPT-2 faithfully the way nanoGPT does. It is not trying to eliminate dependencies the way picoGPT does. It is building a complete pipeline, including synthetic training data generation and a specific character persona, with the minimum code necessary to make that pipeline work.

What a Vanilla Transformer Actually Looks Like

The phrase “vanilla transformer” in the context of these projects refers to the decoder-only architecture from Attention Is All You Need (Vaswani et al., 2017), minus the encoder and cross-attention blocks, following the conventions GPT-2 established. In practice, that means four things stacked in a loop.

First, token embeddings and positional embeddings. A lookup table maps integer token IDs to vectors of dimension d_model. A second lookup table, indexed by position, adds spatial information, since the attention mechanism is otherwise order-agnostic. Both are learned. The original paper used sinusoidal positional encodings, but GPT-2 switched to learned embeddings and every educational project since has followed suit.

Second, multi-head causal self-attention. Each token attends to every preceding token (and itself) through a scaled dot-product operation:

scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
out = attn @ V

The causal mask, that upper-triangular matrix of negative infinities, is what makes the model autoregressive. Without it, the model would be able to attend to future tokens during training, which would make generation incoherent. The Q, K, and V projections each add d_model × d_model parameters. The output projection adds another d_model × d_model. For a 128-dimensional model with 4 heads, that is four weight matrices of 128×128 each, roughly 65,000 parameters per attention block.

Third, a feed-forward network with a 4× expansion:

n.Sequential(
    nn.Linear(d_model, 4 * d_model),
    nn.GELU(),
    nn.Linear(4 * d_model, d_model)
)

The 4× ratio comes directly from the original paper and has remained standard. GELU replaced ReLU in GPT-2 for slightly smoother gradients around zero; educational implementations almost universally follow GPT-2’s choice here.

Fourth, layer normalization and residual connections wrapping each of the above. Pre-norm placement (applying LayerNorm before the sublayer, not after) was one of GPT-2’s departures from the original paper and produces more stable training for small models.

At 9 million parameters with a character-level or small vocabulary tokenizer and perhaps 4-6 layers, GuppyLM’s architecture sits comfortably in this space. The 130-line count is plausible because once you understand the pattern, each block is about 10-15 lines of clean PyTorch.

The Synthetic Data Choice

The choice to train on 60,000 synthetic conversations rather than Tiny Shakespeare or web text is worth examining. It reflects a lesson the research community learned from the TinyStories paper (Eldan and Li, 2023): vocabulary complexity is often a bigger bottleneck than model capacity for small models. TinyStories showed that a 28-million-parameter model trained on stories written with the vocabulary of a 3-year-old could produce surprisingly coherent, structured prose. The model’s capacity was not the limiting factor; the diversity of the training distribution was.

Synthetic conversations have a similar property. Natural conversation has relatively constrained vocabulary and clear structural patterns: a question, an answer, some elaboration. A 9-million-parameter model can learn these patterns thoroughly. Compare this to training on web text, where the vocabulary is enormous, the register shifts constantly, and the structural signals are weak. The model would spend most of its capacity on distributional coverage it cannot fully exploit.

The personality injection is an extension of this idea. By generating training data where a fish character consistently expresses opinions about food, the model learns to associate those patterns with the character’s identity. This is a simplified version of what instruction tuning does: the fine-tuning data shapes not just what the model knows but how it expresses knowledge. The fish’s conviction that the meaning of life is food is not a quirk; it is the training distribution asserting itself.

What You Learn That the API Does Not Teach

Building one of these models teaches you things that are genuinely hard to learn otherwise.

The relationship between learning rate and loss stability becomes visceral when you watch a training run diverge because you chose 1e-2 instead of 3e-4. The theory of why learning rate matters is available in any textbook; the intuition for how sensitive that relationship is only forms through repeated experience.

Gradient clipping similarly. Most tutorials mention torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) without explaining why it is there. Build a model without it and train on data with a long-tail distribution of sequence lengths, and you will discover what a gradient explosion looks like in practice.

Attention patterns become concrete rather than abstract. When your model starts learning, you can visualize the attention weights and watch them shift from near-uniform to peaked distributions. The heads specialize in ways that vary by task. On a name-generation task, some heads learn positional patterns; on a conversation task, some heads learn to track speaker identity. None of this is surprising in hindsight, but watching it emerge from random initialization is clarifying.

The tokenizer choice matters more than most treatments suggest. Character-level tokenization is simple and produces models that can generate any character sequence, but it forces the model to allocate capacity to character-level spelling patterns that subword tokenization handles trivially. Byte-pair encoding trades that capacity allocation for a larger vocabulary and a more complex preprocessing step. For a 9-million-parameter model, the choice has real effects on what the model can learn within its capacity budget.

Where This Sits in the Ecosystem

The educational LLM genre has produced a progressively richer set of tools over the past few years. nanoGPT established the baseline. makemore added the progressive pedagogical structure. llm.c pushed into the systems layer. picoGPT demonstrated that you could strip the implementation to its mathematical core. TinyStories and related work refined our understanding of what small models can actually learn.

GuppyLM adds one thing that the others mostly skip: a complete, self-contained pipeline from synthetic data generation through trained persona. That completeness has pedagogical value. Most educational projects hand you a dataset and start from the training loop. GuppyLM includes the data generation step, which means you can trace the full arc from “I have an idea for a character” to “I have a model that embodies that character.” For someone building their first language model, that end-to-end clarity may matter more than architectural sophistication.

The five-minute training time on a free Colab GPU is not a minor detail. It means the feedback loop is short enough to experiment meaningfully. You can change a hyperparameter, retrain, and observe the effect in the same sitting. That iteration speed is what makes the learning stick.

Fork it. Change the fish to something else. Watch what happens when you alter the synthetic conversation patterns. The point was never the fish.

Was this interesting?