· 6 min read ·

Caption Quality, Noise Schedules, and Why Text-to-Image Training Recipes Outrank Architecture

Source: huggingface

There is a recurring pattern in ML research where the field fixates on architecture while the biggest gains sit in the training recipe. Photoroom’s recent ablation study on their PRX-1.2B text-to-image model, published in February 2026, is a precise demonstration of this. Originally released as part of their ongoing PRX series, the study systematically isolates individual training decisions and ranks them by measured impact. The results are worth dwelling on, because the ranking does not match intuitions formed from reading architecture papers.

The Setup

The baseline is a 1.2B parameter single-stream transformer trained with vanilla flow matching at 256x256 resolution in Flux VAE latent space. Training runs for 100k steps on 1M synthetic MidJourney V6 images. The optimizer is AdamW with lr=1e-4, betas=(0.9, 0.95), epsilon=1e-15. The text encoder is GemmaT5. EMA is disabled. The baseline lands at FID=18.20, CMMD=0.41, DINO-MMD=0.39 at 3.95 batches/sec.

The choice to run controlled ablations rather than report a single final recipe is the most valuable thing about the study. Every intervention is isolated. That isolation is what makes the ranking trustworthy.

The Dominant Factor Is Caption Length

The largest single swing in the entire study is not an architectural change. It is whether you use long, descriptive captions or short ones.

Short captions (roughly one sentence) versus long captions (approximately 160 words describing lighting, composition, color palette, atmosphere, and style) produce the following delta: FID goes from 18.20 to 36.84, CMMD from 0.41 to 0.98, DINO-MMD from 0.39 to 1.14. None of the architectural interventions in the study come close to this magnitude.

The mechanism is straightforward once stated. Each training step associates a latent representation with a text embedding. A richer text embedding imposes a more precise conditioning signal, which gives the model more information to work with per gradient update. Short captions allow the model to learn only coarse associations; the fine-grained structure in the image has no corresponding signal in the text to anchor it.

This has a direct practical consequence. If you are training a text-to-image model and you do not have a pipeline for generating detailed captions, no amount of architecture improvement will recover the quality you are leaving on the table. The Photoroom team uses a recaptioning pipeline to generate their long captions, which is consistent with the approach taken in earlier influential work on training data quality such as the LAION recaptioning experiments and the synthetic data pipeline described in the DataComp work.

Tokenizer Quality Is the Second Largest Lever

After caption richness, the next most impactful intervention is the VAE or tokenizer used to encode images into the latent space the diffusion model operates in. The baseline uses the Flux VAE. Switching to either Flux2-AE (32 channels) or REPA-E-VAE produces approximately a 6-point FID improvement. REPA-E-VAE lands at FID=12.08 with 3.39 batches/sec, while Flux2-AE reaches FID=12.07 at 1.79 batches/sec.

The 55% throughput penalty of Flux2-AE reflects its higher channel count. REPA-E-VAE achieves essentially identical quality at nearly twice the training throughput, which makes it the practical default unless you are willing to pay in compute.

What this tells you is that the latent space quality determines an upper bound on what the diffusion model can learn. If the tokenizer cannot faithfully encode fine image detail, the model cannot reconstruct it regardless of training time. This is a familiar concept from language modeling, where tokenizer design has measurable downstream effects on model capability, but it is underappreciated in the image generation literature relative to the attention paid to denoising network architectures.

REPA Alignment Helps Early, Then Gets in the Way

Representation alignment (REPA) adds a patch-level auxiliary loss between intermediate model features and a frozen vision teacher. The PRX team evaluates both DINOv2 and DINOv3 as teachers. DINOv3 achieves FID=14.64 versus DINOv2’s 16.60, at a 12.4% versus 7.4% throughput reduction respectively.

The nuance here is in the scheduling. The study finds that REPA should be disabled after approximately 200k steps. The explanation is capacity mismatch: the diffusion model outgrows the teacher’s representation as training continues, and the alignment loss starts constraining the model toward representations that are less expressive than what it could learn without the constraint. This is a meaningful implementation detail that is easy to miss if you only read the original REPA paper without running ablations at longer training durations.

Improved REPA (iREPA), which replaces the MLP projection head with a 3x3 convolutional projection and adds spatial normalization, shows consistent behavior with DINOv2 but degrades when using DINOv3. The inconsistency is enough reason to avoid it as a default.

Token Routing Effectiveness Scales With Resolution

TREAD (Token Routing for Efficient Architecture-agnostic Diffusion) routes approximately 50% of tokens through an identity bypass, skipping contiguous layer blocks. At 256x256, this yields only a 4.1% throughput gain while degrading FID by 3.41 points. The token count at this resolution is too low for routing to be effective.

At 1024x1024 in pixel space with 32x32 patches, the picture reverses: TREAD improves FID from 17.42 to 14.10 (a 19% improvement), CMMD from 0.71 to 0.46, DINO-MMD from 0.56 to 0.37, while increasing throughput by 23% from 1.33 to 1.64 batches/sec. This is a rare case where a sparsification technique simultaneously improves quality and speed, which happens because forcing the model to route tokens creates an implicit selection pressure toward the most semantically informative tokens.

SPRINT takes a different approach: dense early layers, 75% token dropout in sparse middle layers, then dense fusion with a residual. At 1024x1024 it reaches 1.89 batches/sec (a 42% throughput gain over the dense baseline) with FID=16.90. It is faster than TREAD but produces slightly noisier output. The choice between them is a speed-quality tradeoff at inference-equivalent training cost.

The BF16 Weight Storage Problem

Storing model weights in BF16 (not just using BF16 for compute) causes significant convergence degradation: FID goes from 18.20 to 21.87, CMMD from 0.41 to 0.61. The affected operations are LayerNorm, RMSNorm, attention softmax, RoPE, and optimizer state, all of which require FP32 precision to accumulate small gradients correctly.

This is not a new finding in principle. The distinction between mixed-precision compute (BF16 activations and matrix multiplications) and precision of weight storage has been discussed in the context of large language model training. But it is useful to see it quantified in the image generation context, particularly because many training scripts default to BF16 everywhere for memory savings. The correct configuration is BF16 autocast for compute, FP32 for weight storage and optimizer state.

Optimizer Choice: Muon Over AdamW

Switching from AdamW to Muon produces FID=15.55, CMMD=0.36, DINO-MMD=0.35 from the baseline of 18.20/0.41/0.39. This is a meaningful improvement across all metrics with no throughput cost. The practical constraint is that Muon currently supports DDP only; FSDP requires community variants.

Muon applies orthogonal gradient updates in weight matrix space, which improves optimization geometry for transformer layers. Its adoption has been growing in the language model community, and the PRX results suggest the same dynamics apply to image diffusion transformers.

The Ranking and What It Implies

Ordering interventions by FID improvement:

  1. Caption richness: 18.64 FID points
  2. Tokenizer quality: approximately 6.0 FID points
  3. REPA alignment: 3.56 FID points
  4. Muon optimizer: 2.65 FID points
  5. Token routing (TREAD at 1024x1024): enables stable high-res training with substantial quality gains

The top two items are about data representation, not model architecture. The third and fourth are training efficiency tools. Architecture variants like contrastive flow matching or x-prediction matter primarily for enabling pixel-space training at scale or for specific use cases, not as standalone quality improvements at the resolutions most commonly trained.

For practitioners, this study is a strong argument for auditing your data pipeline before your model architecture. Long, accurate captions and a high-quality VAE are not glamorous contributions to a paper, but they move metrics more than most architecture papers published in 2025. The full source code for the PRX training framework is promised in Part 3, which will make these findings reproducible rather than theoretical. That reproducibility is what will determine whether the community actually updates its priors.

Was this interesting?