· 5 min read ·

What 400 Architectures Taught the Transformers Team About Code Generation

Source: huggingface

When Hugging Face shipped the first version of Transformers in late 2019, the design was deliberately simple: one model, one file. Every architecture lived in a self-contained modeling_<model>.py with no inheritance from other models, no shared base classes beyond the minimal PyTorch primitives. The rationale was user-friendliness. Someone reading BertForMaskedLM should be able to understand it completely without tracing through three layers of parent classes.

That design held up for about five years and about forty architectures. By the time the v5 release candidate shipped on December 1, 2025, the library contained over 400 architectures and was serving 3 million installs per day. The single-file principle had not changed. The maintenance cost had become severe.

The Annotation That Held Things Together

The workaround was a CI-enforced comment syntax: # Copied from transformers.models.bert.modeling_bert.BertSelfAttention. When RoBERTa’s attention layer was identical to BERT’s, the file would contain the full copy of the class plus this annotation. A linter verified that the copies had not drifted from their sources.

At 40 models, this was a reasonable engineering choice. At 400 models, with GPT-2, BERT, RoBERTa, ALBERT, and hundreds of descendants all sharing 90% of their code, it meant contributor PRs were dominated by mechanical propagation. Fix a bug in BertSelfAttention and you had to update every downstream copy, rebuild the annotations, and survive a CI run that checked each one. The library had become its own maintenance obligation.

The attention mechanism problem was worse. Every new backend, Flash Attention 2, Flash Attention 3, PyTorch SDPA, required a new subclass per model:

# v4: Three classes per model, per attention backend
class LlamaAttention(nn.Module):
    def forward(self, ...):  # eager path

class LlamaFlashAttention2(LlamaAttention):
    def forward(self, ...):  # FA2 path

class LlamaSdpaAttention(LlamaAttention):
    def forward(self, ...):  # SDPA path

LLAMA_ATTENTION_CLASSES = {
    "eager": LlamaAttention,
    "flash_attention_2": LlamaFlashAttention2,
    "sdpa": LlamaSdpaAttention,
}

With 400 architectures and three to four backends each, the library contained on the order of 1,600 attention classes that were all implementing the same scaled dot-product operation with minor variations. This was not an abstraction problem; it was a copy-paste problem at industrial scale.

The Two-Layer Solution

Transformers v5 introduces modular model definitions: a two-layer system where contributors write a modular_<model>.py expressing only what differs from a parent, and a converter script generates the traditional flat modeling_<model>.py. Users interact with the generated file. The generated file is checked into the repository. The two layers serve different audiences.

RoBERTa differs from BERT in exactly one place: the embedding layer adds a padding_idx and adjusts position embeddings accordingly. In v4, modeling_roberta.py was hundreds of lines of near-identical BERT code with that one difference buried in the middle. In v5, the modular source is about thirty lines:

from ..bert.modeling_bert import BertModel, BertEmbeddings, BertForMaskedLM

class RobertaEmbeddings(BertEmbeddings):
    def __init__(self, config):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.position_embeddings = nn.Embedding(
            config.max_position_embeddings,
            config.hidden_size,
            padding_idx=self.padding_idx
        )

class RobertaModel(BertModel):
    def __init__(self, config):
        super().__init__(config)
        self.embeddings = RobertaEmbeddings(config)

The converter script, python utils/modular_model_converter.py your_model, expands this into a complete flat file. It inlines super().__init__() calls rather than preserving them as runtime inheritance, so the generated file remains fully self-contained. del self.attribute inside an __init__ strips that attribute from the inlined parent code. A method overridden with raise AttributeError("") is omitted entirely from the output. The converter also traces implicit dependencies: if a parent class references OlmoMLP and the child class does not override it but does override OlmoDecoderLayer, the converter auto-generates a pass-through Olmo2MLP(OlmoMLP) so the generated file remains consistent.

The converter enforces one constraint: inheritance is flattened to exactly one level. modular_c.py cannot inherit from modular_b.py which inherits from modular_a.py. This prevents deep modular chains that would make the generated output unpredictable.

AttentionInterface

The attention class explosion gets a different fix. Transformers v5 introduces AttentionInterface, a global dispatch registry that replaces per-model attention subclasses. A single LlamaAttention class resolves its backend at runtime:

class LlamaAttention(nn.Module):
    def forward(self, hidden_states, position_embeddings, attention_mask, **kwargs):
        # compute q, k, v...
        attention_interface: Callable = eager_attention_forward
        if self.config._attn_implementation != "eager":
            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]

        attn_output, attn_weights = attention_interface(
            self, query_states, key_states, value_states,
            attention_mask, **kwargs,
        )

The registry supports Flash Attention 2, Flash Attention 3, PyTorch SDPA, FlexAttention, and paged variants of each. Custom backends register with a single call:

AttentionInterface.register("my_attention", my_attention_fn)
model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="my_attention")

The implementation backend can be switched at runtime without reloading weights, which was not possible in v4:

model.set_attn_implementation("sdpa")

For multimodal models with separate vision and text backbones, the implementation can be set per-backbone:

model = AutoModelForImageTextToText.from_pretrained(
    "facebook/chameleon-7b",
    attn_implementation={"vision_config": "sdpa", "text_config": "flash_attention_2"}
)

Pre-compiled kernels distributed through the Hub can also register directly to AttentionInterface on import, bypassing PyTorch and CUDA version compatibility issues:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    attn_implementation="kernels-community/flash-attn2"
)

DRY, With an Asterisk

The design philosophy the v5 team settled on is “DRY*”: don’t repeat yourself for contributors, but repeat yourself deliberately for users. The asterisk covers the entire rationale.

The generated files exist because a user reading modeling_roberta.py should not have to mentally trace through modeling_bert.py to understand how the model works. That was the original motivation for the single-file principle, and v5 preserves it. What changes is that contributors no longer write the flat file by hand. The duplication is generated rather than maintained.

This resolves a genuine tension in library design. The ergonomics of reading code and the ergonomics of writing code pull in opposite directions: reading benefits from everything in one place; writing benefits from expressing only what changes. Code generation lets the library have both.

What Changes for Contributors and Downstream Users

New model contributions added under the modular system automatically inherit future base class improvements without follow-up PRs. If the library adds a new attention backend to Llama, all models that inherit from it in their modular definitions will get it in the next generation pass. In v4, that improvement required a wave of separate PRs touching each downstream model.

For users, the visible API changes minimally. AutoModelForCausalLM.from_pretrained() and pipeline() work as before. The attn_implementation argument was already present in v4; in v5 it routes through the registry rather than selecting a subclass. The main breaking changes are the removal of TensorFlow and Flax backends, the removal of slow Python tokenizers in favor of the Rust-backed tokenizers library, and minimum requirements of Python 3.10 and PyTorch 2.4.

The v5 changes are documented in the official blog post from December 2025. As of early 2026, the release is still in RC with paged attention APIs marked post-RC. The core modular system and AttentionInterface are considered stable, and models added since the December announcement have been absorbed without the mechanical copy-paste work that dominated v4 contributor cycles.

The library’s surface area is still large and the contributor experience still has rough edges. But the # Copied from annotation is gone, ~1,600 attention classes have been collapsed into a dispatch table, and new architectures can be expressed in thirty lines instead of three hundred. That is the kind of maintenance improvement that compounds over time.

Was this interesting?