Caption Quality, Latent Space, and Silent Precision Bugs: What PRX Ablations Reveal About Training Priorities

The usual conversation about text-to-image model quality centers on architecture choices and training objectives: DiT versus MMDiT, flow matching versus DDPM. The implicit assumption is that these macro decisions are where the leverage is. Photoroom’s PRX ablation series, published in February 2026, challenges that framing. The series documents every technique they tested on their 1.2 billion parameter diffusion transformer, including the ones that failed, and the resulting picture of what moves FID scores is different from where most of the discourse lands.

PRX is a text-to-image model built entirely from scratch, trained on relatively modest data, and released under Apache 2.0 with full training code and weights on Hugging Face. Part 1 covered architecture experiments; Part 2, which is what this post focuses on, is a controlled ablation study over a fixed baseline: PRX-1.2B, 100k training steps, 1 million synthetic MidJourneyV6 images at 256x256, batch size 256, AdamW at 1e-4, and the Flux VAE as the latent space. Every experiment isolates a single variable. That discipline is what makes the findings portable to other setups.

The Baseline and the Measurement

The baseline configuration achieves FID 18.2, CMMD 0.41, and DINO-MMD 0.39 at 3.95 batches per second. FID and CMMD are measuring different things: FID captures global distributional similarity, while CMMD (CLIP Maximum Mean Discrepancy) is more sensitive to semantic alignment. DINO-MMD operates in a patch-level feature space and tracks structural fidelity. Having all three available means a technique that improves one metric at the cost of another is visible rather than hidden.

The experiments cover five categories: representation alignment, training objectives, token routing, data design, and practical training details. The effects range from negligible to transformative, and the ordering of those effects is the main thing worth understanding.

Caption Quality Is the Biggest Lever

The most dramatic single finding in the post is the effect of caption length. Switching from long, descriptive captions to short captions causes FID to jump from 18.2 to 36.84 and CMMD from 0.41 to 0.98. DINO-MMD triples. This is not a small regression; it is a collapse. The throughput cost of long captions is zero.

The mechanism is worth thinking through carefully. Long captions provide more tokens for the cross-attention layers to attend to, which creates a richer supervision signal per image. They also constrain the denoising target more precisely: a model learning from “a red vintage motorcycle parked on a cobblestone street in afternoon light” has less averaging behavior to fall into than one learning from “motorcycle.” Diffusion models produce blurry outputs when the training signal is ambiguous, and caption ambiguity is a direct contributor. The practical recommendation from the experiments is to use long captions throughout pre-training and add a short-caption fine-tuning stage at the end if inference with short prompts is required.

This finding has implications beyond PRX. The community has long debated recaptioning strategies for large-scale datasets, and projects like LAION-5B and various filtered subsets have used different caption sources with varying verbosity. The PRX results suggest that verbosity may matter as much as semantic accuracy, which has usually received more attention in the recaptioning literature.

Latent Space Quality Dominates Objective Tweaks

Improving the latent space contributes approximately 6 FID points, more than any objective or architecture change in the study.

The baseline uses the Flux VAE (8x spatial compression, 16 latent channels). Switching to REPA-E-VAE, a VAE trained with representation alignment to vision foundation models, drops FID from 18.2 to 12.08 and CMMD from 0.41 to 0.26, with a 14% throughput cost. Switching to Flux2-AE, which uses 32 latent channels and 32x spatial compression, achieves similar FID (12.07) and dramatically better CMMD (0.09), but at a 55% throughput penalty.

The lesson is that the model can only learn as much as the latent space lets it see. A richer, better-structured latent space makes the denoising task easier at every point in training. Techniques like REPA (representation alignment, which adds a loss term pulling transformer hidden states toward frozen vision encoder embeddings) contribute meaningful improvements, with DINOv3 as the teacher dropping FID to 14.64 at a 12% throughput cost, but they are secondary to the latent space quality itself.

Most papers on improving diffusion training focus on objectives and attention mechanisms. The VAE is often treated as a fixed component. Photoroom’s results suggest it should be treated as a first-class design decision with more leverage than most objective-level changes.

The Muon Optimizer and the Optimizer Assumption

A less expected finding is the magnitude of the optimizer effect. Replacing AdamW with Muon, a second-order-preconditioned optimizer based on Nesterov momentum with Newton-Schulz matrix orthogonalization, drops FID from 18.2 to 15.55 and CMMD from 0.41 to 0.36, with negligible throughput impact.

A 2.7-point FID improvement from an optimizer switch, without any other changes, is a meaningful result. The usual assumption in large-scale training is that Adam-family optimizers are essentially interchangeable at the same learning rate; the Muon result suggests the preconditioner quality matters more than that assumption implies. The main practical constraint is that the official PyTorch implementation requires DDP; FSDP-compatible community variants exist but require more setup.

BF16 Weight Storage: A Silent Regression

The precision bug finding is notable because it illustrates how implementation details can quietly undermine everything else. Storing model weights in BF16, rather than just using BF16 for computation while keeping weights in FP32, causes FID to degrade from 18.2 to 21.87, CMMD from 0.41 to 0.61, and DINO-MMD from 0.39 to 0.57.

The layers most sensitive to this are LayerNorm and RMSNorm, attention softmax logits, rotary position embeddings (RoPE), and optimizer state. BF16 has 8 exponent bits and 7 mantissa bits; FP32 has 8 and 23. For layers where small numerical differences compound across training steps, BF16 weight storage introduces systematic error that accumulates. The correct configuration is BF16 autocast for forward and backward computation, with FP32 for weight and optimizer state storage. This is what mixed-precision training frameworks implement by default, but misconfiguration is easy and the resulting regression does not produce obvious training instability, just quietly worse metrics.

Token Routing: Resolution-Dependent

The results for token routing techniques, specifically TREAD and SPRINT, illustrate a useful general principle: techniques designed for high-resolution training can be counterproductive at low resolution.

TREAD routes a subset of tokens to bypass contiguous transformer layer blocks and re-injects them at a later point, achieving up to 50% sparsification without dropping tokens. At 256x256, this hurts quality (FID 18.2 to 21.61) with minimal throughput benefit. At 1024x1024 with X-Prediction, TREAD improves both quality (FID 17.42 to 14.10) and throughput (1.33 to 1.64 batches per second). SPRINT, which uses a three-stage dense-sparse-dense routing strategy, shows a similar pattern.

The reason is token count. At 256x256 with 16x16 patches, the total token count is low enough that attention is not the bottleneck. Token sparsification at this scale just degrades the attention graph. At 1024x1024 with 32x32 patches, the token count is the dominant cost, and routing buys real efficiency that funds quality improvements through larger batches or longer training.

This resolution-dependence is worth keeping in mind when evaluating papers that propose token routing methods. Results reported at 256x256 or 512x512 may not generalize to higher resolutions in either direction.

X-Prediction and the Tokenizer-Free Option

One of the more technically interesting findings concerns the X-Prediction training objective. Standard rectified flow matching, used in both SD3 and FLUX, trains the model to predict the velocity vector v_θ(z_t, t) = z_1 - z_0, representing movement off the data manifold. X-Prediction instead trains the model to predict the clean image x_θ(z_t, t) directly, then converts to velocity as needed via v_θ = (x_θ - z_t) / (1 - t).

At 256x256 in latent space, results are mixed: FID improves (18.2 to 16.8) but semantic metrics degrade slightly. The more interesting case is at 1024x1024 in pixel space, using 32x32 patches without a VAE. Here, X-Prediction enables stable training at approximately 3x the computational cost of 256x256 latent training, which is a favorable ratio given the resolution jump. Combined with TREAD at this scale, quality improves further. The practical significance is that X-Prediction makes pixel-space training at production resolutions tractable, eliminating the VAE as a dependency and its associated compression artifacts.

What the Speedrun Confirms

Part 3 of the series runs a 24-hour training experiment on 32 H200 GPUs (approximately $1,500 total) combining all the validated techniques: X-Prediction in pixel space, perceptual losses including LPIPS and a DINO-based term, TREAD routing, REPA with DINOv3, the Muon optimizer, and a two-stage training schedule at 512px then 1024px on roughly 8.7 million images. The resulting model shows strong prompt following and consistent aesthetic quality. Remaining issues are undertraining artifacts rather than structural failures, which suggests the technique stack scales predictably with more compute rather than hitting a ceiling.

A Reordered Priority List

Reading the ablation results as a priority ordering is the most direct way to apply them. The biggest gains come from caption quality (no compute cost), latent space quality (REPA-E-VAE at 14% throughput cost), and optimizer choice (Muon, near-zero cost). Representation alignment via REPA adds meaningful improvement at moderate throughput cost. Token routing helps significantly at high resolution but hurts at low resolution. Objective-level changes like contrastive flow matching or X-Prediction have conditional value that depends on resolution and context.

The architecture and training objective discussions that dominate most of the field’s attention are not at the top of this list. That is the central practical lesson of the PRX ablation series, and it is one that is difficult to derive from papers that study single variables in isolation rather than ranking them within a unified framework. Systematic ablation at this scale is expensive and unglamorous work, which is precisely why it is rare, and why Photoroom’s decision to publish everything, including the failures, is worth taking seriously.

The full training code and weights are available on Hugging Face.