The Training Decisions That Compound: Lessons from Photoroom's T2I Ablations
Source: huggingface
Photoroom published the second part of their PRX training series in early February 2026, documenting a methodical set of ablations across training techniques for text-to-image diffusion models. Looking back at it now, what stands out is not any single technique but the overall shape of the results: the decisions that get the least attention in the research literature often account for more FID points than the decisions that get the most.
The baseline they start from is deliberately minimal. Flow matching with AdamW, 100K steps on 1M synthetic MidJourney V6 images at 256x256, GemmaT5 as the text encoder, rotary positional embeddings, no EMA. FID of 18.2. Every technique in the paper gets evaluated against that anchor with three quality metrics (FID, CMMD via CLIP embeddings, DINO-MMD via DINOv2 features) plus throughput in batches per second. The throughput tracking matters because every percentage point of slowdown compounds across long training runs.
Caption Length: The Largest Lever in the Study
The biggest quality gap in the entire paper comes from caption length. Replacing long, detailed captions with short ones pushes FID from 18.2 to 36.84. DINO-MMD goes from 0.39 to 1.14. That is an 18.64 FID gap from a data pipeline decision.
The mechanism is straightforward once you think about it. Short captions leave most of the visual degrees of freedom unspecified. The model, faced with a short label and an enormous space of valid completions, learns to average over that space. The result is blurry in composition, texture, and detail. A long caption that specifies a rabbit’s fur coloring, ear droop, posture, eye color, and the objects visible in the background gives the model a fully constrained target per sample. The network learns to place things; with short captions, it learns a diffuse distribution over where things might plausibly go.
The practical consequence is that caption quality is a data infrastructure problem before it is a modeling problem. A team running on short captions and investing effort in architecture search is optimizing the wrong constraint. Recaptioning with a capable vision-language model, or constructing captions that fully describe scene composition and texture, routinely moves the metric more than the next paper’s technique.
Numerical Precision as a Silent Tax
The precision result is the most directly actionable finding in the paper, and the one most likely to surface as a subtle bug in training scripts derived from other codebases.
Storing model weights in BF16 rather than FP32 (while still using BF16 autocast for compute) costs 3.67 FID points: from 18.20 to 21.87, with CMMD degrading from 0.41 to 0.61. The operations most sensitive to this are layer normalization and RMSNorm statistics, attention logits and softmax, rotary positional embeddings, and optimizer state dynamics. All of these involve accumulated small errors that compound across layers and training steps.
BF16 provides 8 exponent bits and 7 mantissa bits, giving it the same numerical range as FP32 but far less precision per value (FP32 carries 23 mantissa bits). The standard mixed-precision setup that avoids this problem keeps master weights in FP32 and uses autocast only for forward and backward passes. In PyTorch this looks like:
scaler = torch.cuda.amp.GradScaler()
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
loss = model(x)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
The weights in the optimizer remain in FP32 throughout. The common shortcut of casting the saved checkpoint to BF16 to reduce storage eliminates a meaningful chunk of quality for no algorithmic reason, and the resulting FID degradation looks like measurement noise rather than a discrete failure mode.
Representation Alignment: Effective, Conditional, and Best Used as Burn-In
REPA, introduced in a 2024 paper by Yu et al. and accepted at ICLR 2025, aligns the diffusion model’s intermediate token representations with the patch features of a frozen pretrained vision encoder during training. The combined loss is:
L = L_FM + λ · L_REPA
where L_REPA = -E[1/N · Σ_n sim(y_0,[n], h_φ(h_t,[n]))]
The flow matching loss trains the velocity prediction; the REPA loss pushes intermediate representations toward the semantic geometry of DINOv2 or DINOv3. The effect with DINOv3 as teacher is an FID drop from 18.2 to 14.64 with a 12% throughput cost (3.95 to 3.46 batches per second).
The burn-in finding is the part worth emphasizing. REPA provides the most return early in training, when the model’s internal representations are least organized. Photoroom recommends applying it for roughly the first 200K steps, then disabling it. Keeping the auxiliary loss active throughout accumulates throughput cost without proportional continued benefit, since the representational geometry the teacher is enforcing has largely been established.
The REPA-E variant is more structurally interesting: it applies the alignment loss to the VAE’s latent space rather than the diffusion model’s intermediate tokens, training both the VAE and diffusion model end-to-end under a combined objective. This achieves FID 12.08 at 3.39 batches per second, compared to FID 12.07 for the much heavier Flux2 autoencoder at 1.79 batches per second. Roughly equivalent quality at nearly twice the throughput.
The Muon Optimizer
The paper includes a swap from AdamW to Muon, an optimizer that applies Nesterov momentum and then orthogonalizes the gradient update matrix via Newton-Schulz iteration before applying it to weights. This orthogonalization step produces better-conditioned parameter updates: the update is forced to distribute change more evenly across directions in weight space, rather than concentrating on the directions where gradients happen to be largest.
The result is a 2.65 FID gain (18.20 to 15.55) at no throughput cost and no architectural change. Muon gained significant attention in language model training contexts before seeing uptake in image generation; the Photoroom numbers suggest the benefit transfers cleanly. Standard DDP implementations are available, and FSDP variants for larger-scale use exist in the community.
Muon alongside long captions and correct FP32 weight storage accounts for a combined path from 18.2 to somewhere below 15 before touching architecture at all.
Token Routing: Resolution-Dependent Economics
TREAD and SPRINT both implement token sparsification during the forward pass. TREAD randomly routes a fraction of tokens around a contiguous block of layers and reinjects them downstream. SPRINT adds structure: dense processing in early layers, sparse middle layers (roughly 25% of tokens processed), then re-expansion and fusion with a dense residual stream.
At 256x256, the economics are poor. TREAD costs about 3.4 FID points (18.2 to 21.61) for a 4% throughput gain. At 1024x1024, the picture reverses. Self-attention scales quadratically in sequence length, so at higher resolution the cost of processing all tokens through all layers is substantially larger, and routing tokens around blocks of layers provides real savings. TREAD at 1024x1024 improves FID from 17.42 to 14.10 while increasing throughput by 23%. SPRINT takes throughput up by 42% with a smaller quality gain.
The resolution dependence is the key insight. Token routing is not a universally useful technique; it is a technique whose value scales with token count per sample. Evaluating it at 256x256 and concluding it is not worth the quality cost misses what it actually does at the resolution where it matters.
Synthetic Data for Bootstrapping, Real Data for Texture
The data strategy section compares 1M MidJourney V6 synthetic images against 1M Pexels photographs, evaluated against Unsplash as a reference distribution. Synthetic data achieves better FID (18.2 versus 16.6 for real data) but worse texture statistics. Real data produces better photographic quality on DINO-MMD.
The recommended approach is sequential: synthetic data first for fast early convergence and clean compositional structure, then fine-tuning on real data to recover photographic texture and distribution coverage. The fine-tuning stage in the paper uses 3,350 curated image-text pairs for 20K steps after the main pretraining run. The ratio of fine-tuning to pretraining steps (roughly 1:5) reflects how disproportionate the influence of small, high-quality datasets becomes once the model already has good base structure.
What the Numbers Actually Say
The individual results each have value. The compound picture is more useful. The techniques requiring the most engineering investment, REPA-E VAE alignment, JiT pixel-space training, TREAD at scale, have the most conditional value: they matter more at specific resolutions, run lengths, or architectural configurations. The techniques requiring the least investment, correct float precision, detailed captions, optimizer selection, provide unconditional value across configurations.
Caption quality and float precision together represent somewhere between 20 and 25 FID points of improvement that require no novel algorithm and no architecture change. They show up in ablation papers because someone bothered to measure them carefully. Most practitioners encounter them as unexplained variance in their runs. The contribution of a paper like this is making that variance legible.