Better Loss, Worse Biology: What a $165 mRNA Training Run Reveals About Sequence Modeling

The number that gets attention in the OpenMed codon model post is $165. Fifty-five GPU hours on four A100s, spanning twenty-five species, producing six production-ready models. That is a legitimate story about how cheap compute and open datasets have lowered the bar for serious biological ML work. But the more interesting data is buried in the ablation tables, and it concerns a problem that anyone applying NLP methodology to molecular biology will eventually run into: your loss function might not be measuring what you think it is.

What Codon Optimization Actually Is

The genetic code encodes twenty amino acids using sixty-four possible three-nucleotide codons. Most amino acids are encoded by multiple synonymous codons, so the same protein sequence can be written many different ways at the DNA level. Organisms do not use those synonymous codons uniformly. Evolution has shaped each species’ genome toward a particular pattern of codon usage that reflects, among other things, the relative abundance of tRNA molecules in the cell. When a codon for a rare tRNA appears frequently in a gene, the ribosome stalls. When a heterologous gene, such as a therapeutic mRNA injected into a human patient, is written using codons that are optimal for the original organism but not for the host, expression can drop by orders of magnitude.

This is why codon optimization exists as a discipline, and why the Codon Adaptation Index (CAI), defined by Sharp and Li in 1987, remains a primary evaluation target. CAI measures how closely a sequence’s codon usage matches the most highly expressed reference genes in a target organism. For mRNA therapeutics, synthetic biology, and recombinant protein production, optimizing toward high CAI in the host is a concrete engineering goal with measurable downstream consequences.

The natural framing for the language modeling problem: treat each codon as a token, train a masked language model on coding sequences, and let the model learn the organism’s codon preference distribution. The trained model can then score candidate synonymous sequences, and optimization becomes a guided search.

The 64-Token Vocabulary and What It Forces

One underappreciated design choice in this work is how clean the codon tokenization is. Each codon maps to exactly one token. The vocabulary has 64 codon tokens plus five special tokens for a total of 69, or 94 in the multi-species variant that adds twenty-five species-conditioning tokens. There is no subword segmentation, no ambiguity, no fragmentation of biological units across token boundaries. Each position in the sequence corresponds to exactly one amino acid position in the protein.

This contrasts with nucleotide-level models like DNABERT, which operate on k-mers or individual bases and have to learn codon structure from the data. The codon tokenization encodes that structure directly, which is probably a significant reason why RoBERTa converges so cleanly here. The architecture sees 64 symbols with a specific biological meaning rather than a large alphabet of overlapping substrings.

The Perplexity Paradox

The most instructive comparison in the OpenMed work is not between architectures. It is between two versions of the same architecture trained with different schedules.

CodonRoBERTa-large v1 and v2 are identical models: 312 million parameters, 24 layers, 1024-dimensional embeddings, RoBERTa masked language modeling objective. The only differences are a halved learning rate (from 1×10⁻⁴ to 5×10⁻⁵) and doubled warmup steps (from 1,000 to 2,000). The results:

Version	Perplexity	CAI Spearman
v1	4.01	0.025
v2	4.10	0.404

v2 has slightly worse perplexity but sixteen times better correlation with biological codon usage. The model that is better at predicting masked tokens is worse at learning what the masked tokens should be.

This is not paradoxical once you think carefully about what perplexity measures. Masked language model training minimizes prediction loss on the actually observed tokens. A model trained too aggressively learns to recover masked codons by memorizing local sequence context: what codon tends to appear after this particular pattern in this dataset. That captures distributional associations in the training corpus rather than the underlying biological preference structure. The codon frequency preferences that CAI measures are a global property of the genome, not a local sequence motif.

The slower training schedule appears to push the model toward learning that global structure. The model that fits slightly less tightly to the training distribution generalizes to the actual codon usage biases. The practically meaningful metric improved by a factor of sixteen while perplexity moved in the wrong direction.

For anyone building biological sequence models, this is a sharp illustration of why domain-specific evaluation metrics are not optional. A held-out perplexity score on genomic sequences is insufficient evidence that a model has learned anything biologically useful. The standard NLP evaluation scaffold will happily report improving numbers while the model’s biological signal degrades.

Why ModernBERT Failed

The architecture comparison tells a related story. ModernBERT, released in late 2024, represents genuine architectural progress: rotary position embeddings, Flash Attention 2, better scaling behavior, and strong performance on code and scientific text. The OpenMed team initialized ModernBERT-base from its pretrained checkpoint and trained it on the same codon sequences as their RoBERTa variants.

ModernBERT achieved a perplexity of 26.24. CodonRoBERTa-base, trained from random initialization, achieved 4.01 on the same test set. That is a six-fold gap, and the ModernBERT score is worse than the 6-million-parameter CodonBERT baseline from Sanofi.

The difference is initialization. ModernBERT was pretrained on English text, code, and scientific papers. Its weights encode statistical patterns from those domains. When fine-tuned on codon sequences, the model has to unlearn its own initialization. The residual stream has been shaped by natural language; the attention patterns reflect co-occurrence statistics from words and subwords that have no structural relationship to codons.

This is a known failure mode in transfer learning, but worth stating clearly because the intuition to reach for the architecturally superior model is reasonable. ModernBERT is not a bad architecture. The lesson is that pretraining domain matters more than architectural improvements when the target domain is sufficiently different from the source domain. A randomly initialized RoBERTa trained from scratch on codon sequences outperforms a state-of-the-art architecture pretrained on the wrong distribution.

For genomics work specifically, this reinforces that domain-adaptive pretraining from scratch remains the right approach, even as architectural innovations in natural language processing accelerate. The ESM protein language models from Meta and the Nucleotide Transformer from Instadeep both reach the same conclusion from different angles: biological sequences require biological pretraining data.

Multi-Species Conditioning

The multi-species model scales this methodology across nineteen bacterial species, three yeast species, and three mammalian species including human, mouse, and CHO cells, the standard workhorse of biopharmaceutical protein production. The tokenization adds one species token per organism, prepended to each sequence:

[HUMAN] ATG GCT AAA TGG ...
[ECOLI] ATG GCT AAA TGG ...

This is a straightforward conditioning approach that works well. The species tokens give the model a context signal that allows it to maintain organism-specific codon preference distributions within a single set of weights. The alternative, training twenty-five separate models, would cost roughly the same compute but produce models that cannot transfer knowledge across organisms.

The transfer learning result for E. coli validates this clearly. Fine-tuning on 8,547 E. coli sequences, starting from the multi-species base, produces a specialist that outperforms the base model. That is a very small dataset for fine-tuning a 312-million-parameter model. The result reflects genuine cross-species codon preference knowledge embedded in the shared weights from the 381,283-sequence multi-species pretraining corpus.

The approach also creates a natural path for extending coverage to organisms with limited annotated coding sequences. Prepend the closest species token at inference time, then fine-tune on however many sequences are available. The multi-species base provides the starting distribution; even a small amount of organism-specific data can shift it meaningfully.

The Full Pipeline

OpenMed frames CodonRoBERTa as the final stage of an end-to-end protein engineering stack: ESMFold for structure prediction from amino acid sequence, ProteinMPNN for inverse folding (designing sequences that fold into a target backbone), and CodonRoBERTa for codon-optimizing the resulting sequences for expression in the target host. The average pTM score of 0.79 from ESMFold on their thirty-chain test set is consistent with correct fold topology. ProteinMPNN recovering 42% of native amino acid identity using only backbone geometry is within the expected range for that model on diverse structures.

Each stage in this pipeline is open-source and commercially licensable under permissive terms. The combined cost of the CodonRoBERTa training, around $165 for all six models across all species, means the hardest part of this stack to reproduce is not compute or access. It is knowing which hyperparameter settings will produce biologically meaningful representations rather than just low perplexity.

What Accessibility Means Here

NUWA, a competing codon optimization model, trained on 115 million sequences and is not publicly released. mRNABERT used 18 million sequences. OpenMed built competitive models from 381,000 sequences and published everything under Apache-2.0, including training code, data download scripts, and all six model checkpoints on Hugging Face.

The accessibility argument for open biological ML is often made in terms of scientific reproducibility. The more immediate practical point is that organizations working on mRNA therapeutics for neglected diseases, or academic labs engineering proteins for non-commercial research, can run this pipeline without a large compute budget or proprietary model access. The models load with standard transformers tooling and require 16GB VRAM for inference.

The biology has not changed. The 64 codons, the tRNA availability constraints, the CAI metric, the need to match codon usage to the expression host: all of that predates transformers by decades. What the OpenMed work demonstrates is that the modeling cost has dropped to the point where the limiting factor is understanding what you are actually trying to measure, not whether you can afford to find out. The $165 is not the story. The sixteen-fold improvement from changing a learning rate schedule is.