Five Techniques, One Training Run: How Photoroom Built a $1,500 Text-to-Image Model

The number that catches attention first is $1,500. Photoroom trained a text-to-image diffusion model from scratch, on 32 H200 GPUs over 24 hours, for roughly that amount at $2 per GPU-hour. The model produces coherent, prompt-following images at 1024px resolution, and the training code is fully open source.

The price point is a result of five specific technical choices, each of which independently reduces the compute needed to reach a given capability level, or reduces per-step cost, or both. Understanding those choices is more useful than the cost figure itself.

This is the third part of Photoroom’s PRX series. Parts one and two probed individual architectural and training technique variations in isolation. Part three combines everything that showed improvement into a single run under strict budget constraints, producing a recipe others can reproduce.

Pixel Space and the VAE Problem

Every major text-to-image system built after Stable Diffusion, including SDXL, FLUX, and their derivatives, runs the diffusion process in latent space. A VAE compresses images to a lower-dimensional representation, the denoiser learns to operate there, and at inference the VAE decoder converts back to pixels. This keeps sequence lengths manageable, but it introduces the VAE as a quality ceiling: everything the denoiser learns is bounded by what the VAE can reconstruct.

Photoroom’s recipe trains directly in pixel space using an x-prediction formulation, following Li and He’s “Back to Basics: Let Denoising Generative Models Denoise” (2025). Instead of predicting added noise (epsilon prediction) or a score, the model directly predicts the clean image x₀. At patch size 32, a 512px image produces 256 tokens; a 1024px image produces 1024 tokens. Sequence lengths are longer than a latent approach at comparable compression ratios, but the VAE is eliminated entirely, removing one model from the training dependency chain and one potential failure mode from the quality ceiling.

Training runs in two phases: 100,000 steps at 512px with batch size 1024, then 20,000 steps at 1024px with batch size 512. The second phase sharpens detail without disturbing compositions established at the lower resolution.

TREAD: Routing Tokens Around the Middle

The efficiency technique that most directly reduces per-step compute is TREAD (Krause et al., ICCV 2025). At the second transformer block, 50% of the token stream is branched off and reinjected at the penultimate block, bypassing all intermediate layers. The remaining 50% passes through every block normally.

This is not token dropping. The bypassed tokens are preserved throughout and reintroduced before the final layers, so the model retains full spatial information. The effect is that roughly half the computation per step travels a shallow path through the network while the other half takes the full depth. The model learns to allocate which tokens require deep processing.

The more consequential side effect is what TREAD enables for classifier-free guidance. Standard CFG requires two forward passes: a conditional pass with the prompt and an unconditional pass without it, with the final output extrapolated between them. Photoroom implements a self-guidance scheme where the guidance contrast comes from the dense prediction (all tokens through all layers) versus the routed prediction (half the tokens taking the shortcut), both on the same conditional input. No separate unconditional branch is needed. The scheme also avoids the instability that vanilla CFG can produce with undertrained token-sparse models, which is a practical concern at this training scale.

REPA: Alignment with Pretrained Vision Features

Diffusion transformers are slow to build up coherent intermediate representations. Without external supervision, the model discovers its own feature hierarchy through denoising objectives alone, which takes many steps before semantic structure emerges in the middle layers.

REPA (Representation Alignment, Yu et al., 2024) addresses this with an auxiliary loss at an intermediate transformer block, the 8th in this recipe, that pulls the model’s internal representations toward features from a pretrained vision model. The teacher is DINOv3 (Siméoni et al., 2025), and the loss weight is 0.5. During the 1024px training phase, REPA is disabled; the alignment loss only applies to non-routed tokens, and the representations should be stable enough by that point.

The practical effect is that the model enters later training stages with already-structured semantic representations rather than constructing them from scratch. Prior ablation work on REPA has shown meaningful reductions in the step count required to reach a given FID. In a run constrained to 24 hours, compressing the early convergence curve has a direct impact on output quality at the fixed compute budget.

Muon and Structured Gradient Updates

The optimizer is Muon, applied to all 2D weight matrices: the main attention and MLP weights in the transformer. Biases, layer norms, and embeddings use Adam with betas of 0.9 and 0.95 and a learning rate of 1e-4.

Muon emerged from the NanoGPT speedrun community and has since been applied across a range of transformer training experiments. For 2D matrices, it applies Nesterov momentum and then orthogonalizes the update using Newton-Schulz iterations. This makes the effective step size more uniform across parameter directions, regardless of gradient magnitude differences between dimensions. Transformer weight matrices tend to have structured gradient distributions, and the orthogonalization step compensates for directional imbalances that Adam handles less efficiently, producing faster convergence for equivalent wall-clock time.

The specific implementation here is muon_fsdp_2, a version adapted for Fully Sharded Data Parallelism across 32 GPUs. Muon’s convergence advantages are consistent with what the speedrun community documented for language models, now extended to image generation.

Perceptual Losses Across All Noise Levels

Two auxiliary perceptual losses run alongside the primary denoising objective: LPIPS at weight 0.1 and a DINOv2-based perceptual loss at weight 0.01, both applied on pooled full images at every noise level during training.

Perceptual losses are standard in image restoration and GAN-based synthesis, but applying them at high noise levels during diffusion training is less conventional. The PixelGen paper (Ma et al.) established that this produces better pixel-level coherence in pixel-space training, where the model works with raw patches rather than compressed latents. The denoising objective alone does not prevent perceptually incoherent outputs that are numerically valid predictions; the perceptual losses add a direct signal that closes that gap between correctness and visual quality.

Synthetic Data as Knowledge Distillation

The training dataset is roughly 8.7 million images from entirely synthetic sources: 1.7M from a FLUX-generated dataset, 6M from FLUX-Reason-6M, and 1M from a Midjourney v6 dataset recaptioned using Gemini 1.5. No real photographs, no web-scraped content.

Training on outputs from FLUX and Midjourney is knowledge distillation at the data level. The smaller model learns to approximate a distribution shaped by much larger compute budgets, with high-quality captions describing those outputs. The remaining failure modes, texture glitches and occasional anatomy errors on complex prompts, are attributed by the authors to limited data diversity rather than structural problems with the recipe. That is a testable claim: broader synthetic data should address exactly those artifacts without requiring changes to the architecture or training setup.

What the Recipe Reveals

Each of the five techniques targets a specific bottleneck. REPA and Muon compress the step count needed to reach coherent outputs. TREAD reduces per-step compute and simplifies inference-time guidance. Perceptual losses close the gap between the denoising objective and visual quality. Pixel-space training eliminates the VAE as a dependency and removes an architectural quality ceiling. None of these were invented for this run; all come from 2024 and 2025 papers. The contribution is selecting the ones that survive a hard compute budget and combining them without interference.

The speedrun format enforces a discipline that most research does not: it makes the efficiency question explicit rather than treating it as secondary. When you have 24 hours and a fixed GPU count, every technique must justify its compute cost against alternatives. The choices that survive tend to be the ones that work in real training economics, not just in unconstrained settings with unlimited reruns.

The full write-up on Hugging Face includes the complete training configuration, hyperparameter details, and sample outputs from the final checkpoint.