· 7 min read ·

The Priority Stack in Text-to-Image Training

Source: huggingface

The Photoroom team’s ablation study on their PRX-1.2B text-to-image model, published in early February 2026, is one of the more useful public documents on the practical mechanics of training diffusion transformers. Most published work in this space reports end results from fully optimized systems; this one systematically isolates variables and publishes the intermediate numbers. The findings push back on some common assumptions about where training effort is best spent.

The Setup

PRX is a 1.2B parameter single-stream diffusion transformer with global attention and RoPE positional encoding, trained on a 1M-image synthetic dataset (MidJourney V6) at 256×256 resolution using a FLUX-compatible VAE. The baseline uses AdamW with a 1e-4 learning rate and GemmaT5 as the text encoder. Each ablation runs for 100K steps, with FID, CMMD (CLIP Maximum Mean Discrepancy), and DINO-MMD as quality metrics alongside training throughput measured in batches per second. The baseline lands at FID 18.2.

The ablations span representation alignment (REPA and variants), training objectives beyond vanilla flow matching (contrastive flow, x-prediction), token routing for efficiency (TREAD and SPRINT), data decisions (caption length, synthetic versus real images), the Muon optimizer as an AdamW replacement, a supervised fine-tuning phase, and numerical precision.

The Priority Stack

The clearest pattern in the results is that improvements closest to the data and the latent space outweigh improvements to the training objective or attention mechanism. A lot of engineering attention in the image generation space goes to the latter two; the numbers here suggest the priorities should be inverted.

Switching from the default FLUX VAE to REPA-E, a latent space with better structure and learnability, drops FID from 18.2 to 12.08, a six-point gain at a 14% throughput cost. Switching from long, dense captions to short ones moves FID from 18.2 to 36.84, essentially doubling the error. These are the two largest effects in the study, and both involve what the model sees rather than how it processes that signal.

By contrast, adding REPA representation alignment (aligning intermediate diffusion features to a frozen vision encoder via cosine auxiliary loss, from Yu et al. 2024) yields a 3.56-point FID improvement with DINOv3 as the teacher at a 7 to 13% throughput cost. Switching to the Muon optimizer gives 2.65 FID points. Contrastive flow matching adds marginal gains.

The practical priority ordering: fix your latent space, fix your captions, then address training dynamics.

Why Caption Length Matters This Much

The caption quality finding is striking because the magnitude is large and the mechanism is not obviously about inference-time alignment. Dense captions averaging 80 to 150 words describing spatial layout, object relationships, colors, texture, lighting, and style improve convergence behavior throughout training, not just text-following at inference.

The authors frame it as uncertainty collapse: when the conditioning signal is rich and unambiguous, the model can commit to a specific well-posed solution per training example rather than averaging across a distribution of plausible images that fit a vague description. A short caption like “a rabbit on a table” is consistent with dozens of compositionally distinct images; a long caption specifying fur texture, ear position, surface grain, and background lighting has far fewer valid realizations. Gradient signal per step is more consistent, and the model learns to commit earlier.

This connects to the DALL-E 3 paper (Betker et al., 2023), which showed that synthetic long captions generated by a captioning model applied to training images dramatically improve prompt following and multi-object compositionality. PixArt-α (arXiv 2310.00426) made the same finding: a smaller dataset with high-quality captions outperformed one with eight times the images paired with short or noisy text. The Photoroom results quantify this at training time rather than only at inference evaluation, which adds something concrete to the existing literature.

The recommended approach is to train primarily on long synthetic captions and fine-tune on a mix of long and short to handle the shorter prompts users send at inference.

BF16 Weight Storage as a Silent Failure Mode

The numerical precision result deserves attention from anyone encountering quality regressions that don’t track to obvious causes. Storing model weights in BF16 rather than keeping weights in FP32 while using BF16 autocast for compute raises FID from 18.2 to 21.87, a roughly 20% quality regression.

BF16 has wide dynamic range but only 7 bits of mantissa. The precision loss accumulates across thousands of training steps in operations that are numerically sensitive: normalization layers (LayerNorm, RMSNorm), attention softmax, RoPE positional encoding computations, and optimizer state updates. The correct configuration is torch.autocast with dtype=torch.bfloat16 for forward and backward passes while maintaining FP32 master weights, which is documented in PyTorch’s AMP guide but straightforward to misconfigure. The quality impact is large enough that it presents as an unexplained model regression rather than a precision issue, which makes it genuinely difficult to isolate without knowing what to look for.

Token Routing Only Pays Off at Scale

TREAD (Token Routing for Efficient Architecture-agnostic Diffusion training) routes a subset of tokens through full attention and FFN blocks using a learned per-token importance scorer. The results show a sharp resolution dependence.

At 256×256, TREAD gives roughly 4% throughput improvement (3.95 to 4.11 batches per second) at the cost of FID rising from 18.2 to 21.6. The tradeoff is unfavorable. At 1024×1024, the numbers invert: TREAD improves FID from 17.42 to 14.10 while increasing throughput from 1.33 to 1.64 batches per second. Quality improves and training speed improves simultaneously.

The explanation is geometric. At low resolution, the token count is modest, routing overhead is proportionally expensive relative to savings, and the model has enough capacity to benefit from processing all tokens uniformly. At high resolution, tokens dominate compute, routing provides substantial FLOP reduction, and sparse attention over the most informative tokens may provide a regularization benefit in addition to the efficiency gain. SPRINT, a more aggressive sparsification approach, shows the same resolution-dependent pattern but with higher variance at 1024×1024.

This matters for high-resolution training planning specifically. Token routing at that scale is not simply an efficiency technique; it can improve generation quality while reducing compute, which is a more compelling argument than efficiency alone.

Muon as a Serious AdamW Alternative

Muon (Momentum Orthogonalized by Newton-Schulz), developed by Keller Jordan through the modded-nanogpt community speedrun project, applies Newton-Schulz iteration to orthogonalize gradient update matrices before applying them. This approximates steepest descent under the spectral norm, capturing correlations between parameters within each weight matrix that AdamW’s diagonal preconditioner misses. Conceptually it sits between AdamW and full second-order methods like Shampoo, with Newton-Schulz iterations cheap enough to apply per step without storing large factor matrices.

The Photoroom results show Muon drops FID from 18.2 to 15.55 with faster early convergence and cleaner training progress in the initial phase. This is consistent with language model training results, where Muon reaches target validation loss in fewer steps than AdamW at comparable compute budgets. The practical constraint is that Muon applies cleanly to weight matrices under DDP but requires community variants for FSDP, and biases, embeddings, and 1D parameters still need AdamW. It is a configuration decision with meaningful FID payoff, not a drop-in replacement.

REPA as Burn-In, Not Permanent Constraint

The representation alignment results carry a nuance the summary numbers alone don’t convey. REPA aligns intermediate diffusion transformer features to a frozen vision encoder using a cosine auxiliary loss alongside the main flow matching objective. At 100K steps, REPA-DINOv3 reaches FID 14.64 versus baseline 18.2, and generated images show cleaner global structure earlier in training.

The authors recommend disabling REPA after roughly 200K steps. The argument is capacity mismatch: as training progresses, the denoising network needs to model aspects of images that the vision encoder was not trained to represent, including texture synthesis, fine detail generation, and noise-level-specific processing patterns. Forcing continued alignment to a frozen semantic encoder constrains what the diffusion model can learn in later training stages. Used as initialization pressure on the representation structure, REPA is useful; maintained throughout training, it becomes a ceiling on generalization.

iREPA, which replaces the MLP projector with a 3×3 convolutional head and applies spatial normalization to remove global overlay, shows inconsistent results across teacher models and is not recommended as a default.

What the Pattern Says

Running through all of these results, the pattern holds consistently: data quality and representation quality at the input layer have larger leverage than training objective refinements. The latent space improvement is the single largest effect. Caption quality is second. Optimizer, alignment techniques, and token routing add meaningful but smaller gains.

This should affect how teams allocate engineering effort when training new models. The components that feel like infrastructure, the VAE, the captioning pipeline, the numerical precision configuration, drive more of the outcome than the architectural choices or objective functions that attract more attention in published work. Getting those right before tuning training dynamics is the more efficient path.

The Photoroom team has indicated they plan to release the full training framework and run a public 24-hour speedrun combining the best techniques from the study. A reference implementation with these decisions made explicit, and with the ablation results as context for why each choice was made, would give the community a more honest starting point than the current scattered set of baselines, each making different implicit assumptions about latent quality, caption richness, and training precision.

Was this interesting?