Sixteen Teams, One Architecture: The Case for Disaggregated RL Training

There is a common pattern in systems design where a sufficiently constrained problem has only one solution. You can approach it from different directions, starting with different codebases, different organizational priorities, and different algorithms, and you still end up with the same architecture. The Hugging Face survey of sixteen open-source reinforcement learning libraries for LLMs documents exactly this phenomenon, and the architecture they all converged on is worth understanding in detail.

The Problem: Generation Is Slow, Training Is Fast

A synchronous RL training loop executes phases in strict sequence: sample prompts, call model.generate(), score rewards, compute advantages, run the forward and backward passes, step the optimizer, sync weights, repeat. The immediate problem with this sequence is that model.generate() is autoregressive. It produces tokens sequentially, one at a time, and the GPU is memory-bandwidth-bound throughout. Training hardware sits idle for the entire duration.

The severity of this depends on rollout length. For short preference tasks, generation might take a few seconds and the synchronous overhead is manageable. For chain-of-thought reasoning tasks, the picture changes entirely. A GRPO training step at 32K output tokens with G=8 completions over 64 prompts generates roughly 16 million tokens. On a single H100 running a 32B model at approximately 1,200 output tokens per second, that single generation step takes around 3.7 hours. Scaling to 8 inference GPUs brings this to roughly 28 minutes. The training step itself might take a few minutes. The rest of the time, training GPUs wait.

This is not a matter of insufficient hardware. It reflects a fundamental asymmetry: autoregressive generation requires sequential token production with KV cache access patterns that do not parallelize across time, while gradient computation is embarrassingly parallel across the batch dimension and across layers. These two operations have different rooflines, different memory access patterns, and different scaling laws. Running them on the same hardware in alternating phases penalizes both.

GRPO Eliminated the Critic and Made Things Worse

The algorithms driving reasoning model training today, primarily GRPO and REINFORCE++, are critic-free. They eliminate the value network and compute advantages by comparing G completions sampled for the same prompt: each completion’s reward is scored against the group mean. This saves roughly half the training memory and removes the instability inherent in critic learning.

But critic-free algorithms require more rollouts per prompt (G=8 to G=32 in common configurations), and each group’s batch is gated by its slowest completion. Chain-of-thought sequences vary from a few hundred tokens to 32K or more. The last sequence in a group blocks reward computation for all others. Straggler effects that were manageable in short-rollout settings become dominant in reasoning training.

Critic-free methods also cause the policy to change faster per update, since there is no value function providing a stable baseline to slow policy drift. Faster drift means that staying within a small off-policy error bound requires more frequent weight synchronization between the inference pool and the training pool, which in a synchronous system adds more blocking overhead.

The Architecture That Emerged

Sixteen different teams, including groups at ByteDance, NVIDIA, Google, Meta, Alibaba, Ant Group, AllenAI, and NousResearch, arrived at the same structural solution independently. The inference pool and training pool run on separate GPU allocations, connected by a rollout buffer.

Inference Pool (vLLM / SGLang)
         |
         v
  Rollout Buffer (FIFO, bounded or unbounded)
         |
         v
Training Pool (FSDP / DeepSpeed / Megatron)
         |
         +-- periodic weight sync --> Inference Pool

Both loops run concurrently at their own pace. The inference pool generates sequences continuously and pushes completed (prompt, completion, logprob) tuples into the buffer. The training pool pulls batches from the buffer whenever a full batch is ready, computes gradients, and periodically transfers updated weights back to the inference pool. Neither side waits for the other.

Eight of the sixteen libraries use Ray for orchestration. This concentration is not coincidental. Ray’s actor model maps directly onto the RL component structure: each role (inference server, trainer, reward model, environment) becomes a Ray actor with declared resource requirements. The shared-memory object store enables zero-copy tensor transfer between co-located actors, which matters when rollout batches can be tens of gigabytes. Long RL runs spanning days to weeks also benefit from Ray’s actor restart policies. The alternatives include native Python concurrency, Redis pub/sub, HTTP microservices, and JAX/XLA cross-mesh execution, each making different tradeoffs on simplicity versus scalability.

Weight Synchronization: Where the Complexity Lives

Transferring updated weights from the training pool back to the inference pool is technically the most demanding part of the architecture. A 32B model in bfloat16 is roughly 64GB. A full NCCL broadcast from the training process group to the inference engine takes 100 to 500 milliseconds on a modern cluster, during which the inference engine is interrupted and in-flight generation requests must either be paused or aborted.

Three optimizations have emerged from this constraint. First, verl packs parameters into configurable uint8 buckets and broadcasts them over a dedicated NCCL communicator separate from the training process group, reducing sync latency to approximately 20ms. Second, LoRA adapter-only synchronization transfers only the adapter parameters (roughly 50MB at rank 32) rather than the full model, making sync sub-millisecond and nearly eliminating the overhead for fine-tuning workloads. Thirteen of the sixteen surveyed libraries support LoRA; eight implement adapter-only sync. Third, PipelineRL from ServiceNow swaps weights between individual transformer forward passes, with interrupts lasting 1 to 10 milliseconds, so running sequences never need to be aborted.

The interrupt strategy varies widely across libraries. Some drain in-flight requests and sync during the resulting quiet window. Others abort requests and resubmit with prefix cache resumption. Atropos from NousResearch restarts the inference process entirely on each sync, which is operationally simple but expensive. NeMo-RL from NVIDIA uses CUDA IPC for zero-copy weight transfer between co-located actors, avoiding serialization overhead entirely.

Staleness: The Correctness Tradeoff

Because training pulls from a buffer filled by an earlier version of the policy, every async RL system trains on stale rollouts to some degree. The policy that generated a rollout and the policy computing the loss are not identical, which introduces off-policy error. PPO’s clipping objective provides some tolerance for this, but GRPO’s group-relative advantage computation is more sensitive to large policy shifts.

Three approaches have emerged. Hard version rejection tags each sample with the policy version at generation time and discards samples older than a threshold; NeMo-RL’s max_trajectory_age_steps parameter is a typical implementation. Buffer depth bounding limits how many steps behind a rollout can be by constraining the buffer capacity, providing a coarser guarantee without per-sample bookkeeping. Importance sampling reweights stale samples by the ratio of current policy probability to generation-time policy probability, then clips the ratio to control variance; ROLL from Alibaba implements six IS variants including CISPO, TIS, and TOPR.

Most mature libraries combine at least two of these strategies, typically depth bounding plus optional clipped IS with a cap between 1.5 and 5.0 to prevent high-variance gradient estimates from destabilizing training.

The Unsolved Problems

The survey identifies several challenges that none of the sixteen libraries fully address, and they point toward where the next round of divergence will occur before the ecosystem converges again.

The most technically interesting is training-inference consistency for Mixture-of-Experts models. MoE gating is floating-point arithmetic, and vLLM and Megatron can route tokens to different experts for identical inputs due to FP32 accumulation differences between kernels. This means the model activations seen during training differ from those seen during generation, breaking the importance sampling assumption and injecting noise into advantage estimates. DeepSeek-V3.2 addresses this through “Keep Routing,” which records expert routing decisions during inference and enforces them during the training forward pass. No current open-source library implements this, and doing so requires the inference API to return routing metadata alongside the standard (token_ids, logprobs, finish_reason) response.

Process reward model pipelines present a related bottleneck. A PRM forward pass over a 32K-token reasoning trace can cost as much as generating the trace itself. At G=8 completions per prompt, reward scoring may become the next throughput ceiling after generation. PRIME-RL has a pipelined orchestrator-trainer architecture that addresses this; most other libraries treat reward computation as a synchronous blocking call.

Multi-agent training sits outside the current abstraction entirely. All sixteen libraries model the training unit as a single (prompt, completion, reward) triple. Episodes involving tool calls, multi-turn dialogue, and inter-agent messages require advantage computation over directed graphs of turns, with version tracking at the episode level rather than the sample level. MiniMax’s Forge framework, used for MiniMax-M2.5 with context lengths up to 200K tokens and over 100,000 distinct agent scaffolds, demonstrates what production-scale agentic RL requires, but none of the surveyed open-source libraries have implemented episode-level abstractions.

What Convergence Means

When many independent teams solve the same problem and reach the same answer, it usually means the problem structure constrained the solution space more than the teams’ individual approaches did. In this case, the constraints are GPU memory bandwidth limits, the sequential nature of autoregressive generation, algorithmic requirements for reasonably fresh rollouts, and the mathematical structure of off-policy RL corrections. The disaggregated architecture is not a clever hack or a premature optimization; it is the minimum viable response to those constraints given the direction reasoning model training has taken.

The Hugging Face survey makes it plausible that the same convergence will happen in the unsolved domains as well. MoE training-inference consistency, PRM async pipelines, and episode-level abstractions for multi-agent RL are today’s open problems. Once enough teams independently hit the same walls, the ecosystem will likely produce another round of convergent solutions, and those solutions will probably look similar across organizations for the same reason they do today.