What Actually Moves the Needle When Training Text-to-Image Models
Source: huggingface
Ablation studies are the closest thing machine learning has to controlled experiments, and good ones are rare. Most papers report a final configuration without showing which components carried their weight and which were incidental. Photoroom’s PRX Part 2 is one of the better examples I’ve seen: a 1.2B-parameter flow-matching text-to-image model, trained from scratch, with each variable isolated against a fixed baseline. Published in February 2026, it’s worth looking at carefully, because the priority ordering it reveals cuts against the usual assumptions.
The baseline is precise: 100k steps on 1M synthetic MidJourney v6 images at 256x256, global batch size of 256, AdamW with lr=1e-4 and betas=(0.9, 0.95), the Flux VAE, a GemmaT5 text encoder, RoPE positional encoding, and a standard rectified flow matching objective. That starting point produces an FID of 18.20. Every number below is measured against it.
The Latent Space Is the Ceiling
The single largest improvement in the entire study comes from upgrading the VAE. Swapping in FLUX2-AE, which uses 32 latent channels instead of the typical 16, drops FID from 18.20 to 12.07. Paired with an end-to-end approach called REPA-E, which trains the tokenizer jointly rather than using a frozen one, the result is nearly identical at 12.08. Both are approximately 6 FID points better than the baseline, a larger delta than any other intervention tested.
This matters because practitioners tend to treat the VAE as a fixed component: choose Stable Diffusion’s VQVAE or the Flux VAE, freeze it, train the denoiser. The PRX results put a number on what that assumption costs. A better latent space is worth more than any single architectural or loss change, and it comes at a throughput cost of roughly 14-55% depending on the approach, which is a real trade-off but a knowable one.
The broader lesson here is consistent with the DiT paper (Peebles and Xie, 2022): scaling diffusion transformers in compute is productive, but the VAE quality functions as a hard ceiling on what any denoiser can learn. You cannot denoise your way past tokenizer loss.
Caption Quality Swamps Architecture
The data ablation is the most striking result. Long, descriptive captions versus short captions produces an FID of 18.20 versus 36.84. That is a gap larger than any other single variable in the study, including the latent space upgrade.
The specific difference being tested is between multi-clause captions that describe composition, lighting, materials, and object relationships versus sparse tags or short phrases. This aligns with what Google’s Imagen paper (Saharia et al., 2022) showed about text encoders: richer semantic signal, whether from better encoders or better descriptions, compounds through training in ways that are difficult to recover from later. The model simply cannot learn to condition properly on attributes it never sees described precisely during training.
For anyone sourcing or generating their own training data, this is the highest-leverage intervention available. Before tuning the learning rate or trying a new loss function, recaptioning the dataset with a capable language model is probably worth doing first.
Flow Matching Objectives: Why x-Prediction Outperforms Velocity
The standard rectified flow objective predicts the velocity field: the direction and speed at which the model should push a noisy sample toward the data distribution. The PRX study tests an alternative called x-prediction, also called “Back to Basics” or JiT, where the model directly predicts the clean image instead. The velocity is then derived from the clean image prediction at inference time.
At 256x256 in latent space, x-prediction drops FID from 18.20 to 16.80 with no throughput change. The intuition is that clean images lie on the data manifold, making them structurally constrained targets, while velocity fields are unconstrained vectors that vary arbitrarily with timestep. Predicting a face is a different kind of problem than predicting the direction a noisy face should move, and the former has more natural structure to exploit.
What makes this practically interesting is that x-prediction also enables stable pixel-space training at 1024x1024 resolution. The throughput cost compared to the 256x256 latent baseline is only about 3x, which is surprisingly manageable for VAE-free high-resolution training. This sidesteps the tokenizer ceiling problem entirely at the cost of compute, which is a legitimate trade-off depending on your target application.
The Min-SNR loss weighting from Hang et al. addresses a related but distinct problem: across different timesteps, the effective loss scale varies by orders of magnitude, creating conflicting optimization pressures. Clamping the per-timestep weight to min(SNR(t), 5) / SNR(t) has been shown to produce 3.4x faster convergence and is essentially free to implement. It is worth noting that PRX did not test Min-SNR directly, but it is a well-established baseline that predates this work and is supported natively in Diffusers via the --snr_gamma flag.
REPA Is a Burn-In, Not a Permanent Objective
Representation Alignment (REPA) adds an auxiliary loss that aligns the denoising transformer’s patch-level features with a frozen pretrained vision encoder, either DINOv2 or DINOv3. The combined loss is:
L = L_FM + λ * L_REPA
where the alignment term maximizes cosine similarity between noisy intermediate features and clean encoder features.
With DINOv3, this drops FID from 18.20 to 14.64 at a 12.7% throughput cost. The original REPA paper reported a 17.5x training speedup for SiT, reaching FID 1.42 on ImageNet, which established it as a serious technique. The PRX study adds an important practical finding: REPA should be used as a burn-in phase of roughly 200k steps, then switched off.
The reason is that the alignment loss, once the model has reached a reasonable feature space, starts limiting the model’s ability to develop its own generative representations. The vision encoder is trained for recognition, not generation; prolonged alignment pulls the denoiser’s internal representations toward features that are useful for classifying images rather than synthesizing them. Treating REPA as a curriculum component rather than a permanent objective is the right framing.
Resolution Changes Which Techniques Are Beneficial
Token routing methods like TREAD and SPRINT selectively bypass computation for a fraction of tokens, either by routing them around attention blocks or through sparse middle layers. At 256x256, both hurt quality: TREAD raises FID from 18.20 to 21.61 while gaining only 4% throughput. That is a bad trade at any scale.
At 1024x1024, the picture reverses. TREAD drops FID from 17.42 to 14.10 while increasing throughput by 23%. SPRINT reaches 42% throughput improvement at a moderate quality cost. The difference is that at higher resolutions, the quadratic attention cost dominates, and routing tokens effectively gives the model a way to focus computation where it matters.
This result has a general implication: evaluate optimizations at your target inference resolution, not at whatever resolution is convenient for quick experiments. A technique that looks harmful at 256x256 may be one of your best options at 1024x1024, and vice versa.
Optimizer and Precision Pitfalls
Two findings here are worth flagging separately because they affect implementation rather than design.
The Muon optimizer, which applies better-conditioned gradient updates than AdamW through a Newton-Schulz orthogonalization step, reduces FID from 18.20 to 15.55 at no throughput cost. This is a substantial improvement for essentially free, and the community FSDP implementation makes it accessible for distributed training setups beyond the official DDP version.
The BF16 weight storage bug is a cautionary finding. Using BF16 for the actual stored weights, rather than only for autocast during the forward pass, silently degrades FID from 18.20 to 21.87. The distinction is subtle: BF16 autocast means the weights are still FP32 in memory, with BF16 used only for matrix multiplications. Storing the weights themselves in BF16 introduces enough precision loss in normalization layers, attention softmax operations, RoPE positional encoding, and the optimizer state to cost nearly 4 FID points. This is the kind of bug that passes silently because training still converges; it just converges to a worse solution.
The Priority Ordering
Pulling these results together, a rough priority ordering emerges for practitioners training text-to-image models from scratch:
- Latent space quality, whether through a higher-capacity VAE or end-to-end tokenizer training
- Caption quality and length, which outweighs most other interventions
- Training objective, with x-prediction consistently outperforming velocity prediction
- Representation alignment as a burn-in phase, not a permanent loss
- Optimizer choice, where Muon gives a free improvement over AdamW
- Resolution-appropriate efficiency techniques, evaluated at target resolution
- Implementation correctness, particularly around weight precision
What is notable about this ordering is that the top two items are about infrastructure and data, not model architecture. The architecture decisions that receive the most attention in research papers, transformer depth, attention mechanisms, positional encodings, are further down the list than the choices that look like prerequisites. The PRX study is a useful corrective to the habit of treating those prerequisites as settled while obsessing over the interesting parts.