· 5 min read ·

The Annotation That Held Transformers Together for Five Years

Source: huggingface

When HuggingFace released Transformers v4 in November 2020, the library had around 40 model architectures and roughly 1,000 checkpoints on the Hub. By the time v5 shipped in December 2025, those numbers had grown to 400+ architectures and 750,000+ Hub checkpoints. Daily installs went from around 20,000 to about 3 million. The library had become, functionally, the runtime layer of the open-source AI ecosystem.

That growth happened under the weight of a design decision that was smart in 2020 and increasingly painful by 2024: one model, one file.

What “One Model, One File” Actually Meant

The philosophy was defensible. If you want someone to understand a model implementation, give them a single file that contains everything: the configuration class, embeddings, attention, encoder, decoder, output heads. No chasing imports across five modules. The entire architecture, self-contained.

This worked well when the library had 40 models. It worked less well when RoBERTa needed to be mostly BERT, and ALBERT needed to be mostly BERT with a few tweaks, and DistilBERT needed to be mostly BERT with some layers removed. Each of those models had their own modeling_*.py file, and each file contained hundreds of lines copied verbatim from modeling_bert.py.

The library’s answer was a CI-enforced annotation: # Copied from transformers.models.bert.modeling_bert.BertSelfAttention. If you changed the source, the CI would check that all downstream copies were updated to match. It was a constraint system built on comments, and it actually worked reasonably well for years. Contributors could find where code came from, reviewers could track divergence, and the one-file guarantee held.

The problem was friction at scale. Adding a new model meant copying hundreds of lines, threading through the annotation requirements, and then maintaining that copy indefinitely. The GPT-2, GPT-J, GPT-Neo, and GPT-NeoX models are all variations on a theme. So are the dozen-plus LLaMA variants. The library was spending enormous energy enforcing consistency across files that differed by a handful of lines.

The v5 Solution: Modular Files with a Converter

The modular transformers system introduced in v5 is a two-layer design. Contributors write a modular_<model>.py file that uses normal Python inheritance to express only what’s different about their model. A converter tool then unravels that modular file into the standard flat modeling_<model>.py that users and downstream tools consume. The end-user API doesn’t change.

Here’s what a modular RoBERTa definition looks like:

from torch import nn
from ..bert.configuration_bert import BertConfig
from ..bert.modeling_bert import BertModel, BertEmbeddings, BertForMaskedLM

class RobertaConfig(BertConfig):
    model_type = 'roberta'

class RobertaEmbeddings(BertEmbeddings):
    def __init__(self, config):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.position_embeddings = nn.Embedding(
            config.max_position_embeddings, config.hidden_size,
            padding_idx=self.padding_idx
        )

class RobertaModel(BertModel):
    def __init__(self, config):
        super().__init__(config)
        self.embeddings = RobertaEmbeddings(config)

The converter reads this file, traverses the inheritance chain, and generates the full flat modeling_roberta.py with everything inlined. Users still get the one-file model; contributors only write the diff.

Running the converter is a single command:

python utils/modular_model_converter.py your_model

The converter has some specific behaviors worth understanding. Inheritance is flattened only one level at a time, so the generated output doesn’t contain abstract intermediate classes. When you call super().__init__(...), the converter inlines the parent’s __init__ body at that call site rather than generating a super() call in the output file. Functions that the modular file imports are automatically pulled into the generated file along with their transitive dependencies, so if apply_rotary_pos_emb calls rotate_half, both end up in the generated file.

There’s also a mechanism for removing things. If a child config doesn’t need an attribute from its parent, del self.attribute in the __init__ body signals the converter to exclude that attribute’s assignment from the generated code:

class Olmo2Config(OlmoConfig):
    def __init__(self, ..., rms_norm_eps=1e-5, **kwargs):
        super().__init__(...)
        self.rms_norm_eps = rms_norm_eps
        del self.clip_qkv  # excluded from the generated flat file

And if a child class needs to remove a method that exists on the parent, raising AttributeError in the method body signals the converter to omit it entirely from the generated output. This is used in the GemmaTokenizer, which inherits from LlamaTokenizer but doesn’t need get_spm_processor.

The net effect is that adding a new model now requires writing only the lines that differ from an existing architecture. Classes that aren’t explicitly redefined but are needed as dependencies get auto-generated based on the closest parent. The docs provide a table mapping common components (MoE layers, RoPE variants, sliding window attention, fused QKV) to their canonical source models, so contributors have a clear starting point for inheritance.

AttentionInterface: Centralizing the Other Scattered Code

Attention implementations had a parallel problem. Each model’s forward() method contained its own attention code, with small variations for whether it was using SDPA, FlashAttention 2, or the eager fallback. The AttentionInterface registry in v5 decouples attention backends from model code entirely.

The registry supports eager, SDPA, FlashAttention 2, FlashAttention 3, FlexAttention, and paged variants of each. You can set the backend at load time:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B", attn_implementation="flash_attention_2"
)

Or switch at runtime without reloading the model:

model.set_attn_implementation("sdpa")

You can also register a custom attention function and use it by name:

from transformers import AttentionInterface
from transformers.integrations.sdpa_attention import sdpa_attention_forward

def my_attention(module, query, key, value, attention_mask, **kwargs):
    print("entering attention")
    return sdpa_attention_forward(module, query, key, value, attention_mask, **kwargs)

AttentionInterface.register("my_attention", my_attention)
model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="my_attention")

Multimodal models can use different backends per sub-model:

model = AutoModelForImageTextToText.from_pretrained(
    "facebook/chameleon-7b",
    attn_implementation={"vision_config": "sdpa", "text_config": "flash_attention_2"}
)

Hub-based kernel loading is also new in v5: you can pass a Hub identifier as the attn_implementation string, and the library will pull the registered kernel directly, without requiring a separate package install for every attention variant.

What v5 Dropped

The TensorFlow and Flax backends are sunset in v5. PyTorch is now the sole primary framework. This is a pragmatic call: maintaining three framework implementations for every new model architecture was multiplying the maintenance surface without proportional benefit. JAX users are directed to MaxText and other ecosystem projects that provide JAX-native implementations.

The fast/slow tokenizer distinction is also gone. The tokenizers Rust library is now the default for everything; the slow Python tokenizer implementations have been deprecated. Similarly, image processor “slow” variants are no longer supported, with the torchvision-based fast implementations taking over.

Python 3.10 and PyTorch 2.4 are now minimum requirements. If your toolchain is behind those versions, v4 is still available, but v5 draws a line and moves forward.

The Maintenance Infrastructure Point

The interesting thing about Transformers v5, viewed from a distance, is how much of it is about sustaining the library rather than extending it. The modular converter, the AttentionInterface registry, the backend consolidation, the Python version floor: these are all changes that make it more tractable to keep adding models and features at the pace the library has been growing.

The # Copied from annotation was a reasonable answer to a problem that didn’t seem like it would scale to 400 architectures in 2020. By 2024, it was visibly holding things back. The modular system keeps the one-file guarantee for users while removing the copy-paste requirement for contributors. That’s a narrow but important design win: it changes the internal maintenance cost without changing the external contract.

For the ecosystem that has built on Transformers over the past five years, vLLM, SGLang, ONNX Runtime, llama.cpp, MLX, the library’s stability as a shared foundation is more important than any individual new feature. A version that makes it easier to add 400 more architectures without tripling the maintenance burden is, in a practical sense, the most important kind of release.

Was this interesting?