· 6 min read ·

The Architecture Every RL Training Library Independently Reinvented

Source: huggingface

When sixteen independent teams, working at ByteDance, Google, Meta, NVIDIA, Tsinghua University, and more, all converge on the same core architecture without coordinating with each other, the architecture is probably right for reasons that have nothing to do with taste. It is being forced by the hardware. This is what the Hugging Face survey of async RL training systems actually reveals, and it is more interesting than a library comparison.

The shared pattern is disaggregation: separate the inference workers (the ones generating rollouts) from the training workers (the ones running backprop), let them run on distinct GPU pools, and connect them with an asynchronous weight synchronization channel. Every major library in the RL-for-LLMs space, from verl to NeMo-RL to open-instruct to Tunix, landed here independently.

Why Generation Cannot Share a GPU with Training

Autoregressive text generation and gradient computation have fundamentally different GPU utilization profiles. Generation is memory-bandwidth-bound: each forward pass reads the KV cache for every token in every in-flight sequence, and the compute intensity (FLOPs per byte of memory accessed) is low. Training is compute-bound: the backward pass is dominated by large matrix multiplications with high arithmetic intensity.

These two workloads compete for the same resource in opposite ways. Inference engines maximize throughput using continuous batching, PagedAttention, and speculative decoding, techniques that make assumptions about memory layout that FSDP’s all-gather pattern violates. Training frameworks maximize compute utilization with pipeline parallelism and gradient checkpointing, which stalls memory reads in ways that inference KV cache management cannot tolerate.

The throughput gap is large and grows with model size. On a single H100, a 7B model generates roughly 6,000 tokens per second; a 32B model generates around 1,200. For a typical batch of 512 rollouts with 8K output length, that is 11 minutes and 56 minutes respectively, per batch, just for generation. Training throughput on the same hardware can process the same batch in under 5 minutes for the 32B model. The ratio worsens as sequence lengths increase; in reasoning tasks, chain-of-thought outputs regularly reach 32K tokens.

A synchronous baseline, where generation, reward scoring, and backprop all block each other, typically achieves 20 to 40 percent GPU utilization across the full RL loop for large models. Most of the wall clock is spent with either the inference GPUs or the training GPUs idle.

The Disaggregated Architecture

The solution is architecturally simple even if the implementation details are not:

Inference Pool  ->  Rollout Buffer  ->  Training Pool
                                            |
                <---- weight sync ----------+

The inference pool runs a fast inference server, with vLLM and SGLang as the dominant choices, under the current policy. The training pool runs the optimizer, usually FSDP2 or Megatron-Core, on completed rollouts drawn from the buffer. After each optimizer step, updated weights flow back to the inference server. The buffer absorbs timing mismatches between the two pools, at the cost of introducing staleness: samples in the buffer were generated under a slightly older policy version.

Ray is the orchestration substrate of choice for eight of the sixteen libraries surveyed. Its actor model lets you allocate heterogeneous resources per component type, and its Plasma object store allows zero-copy transfer of rollout batches between workers without serialization overhead. Google’s Tunix uses JAX with a ThreadPoolExecutor instead; NousResearch’s Atropos treats each component as a standalone HTTP microservice. The orchestration mechanism matters less than the logical separation of concerns.

Where Teams Diverge

The split into inference and training pools is universal. Teams made genuinely different choices on three axes that affect correctness and throughput in real deployments.

Weight synchronization latency ranges from sub-millisecond to several seconds depending on the transport. Naive NCCL calls on individual parameter tensors can take 100 to 500 milliseconds for a 32B model. Bucketing parameters into 1GB contiguous buffers before the NCCL broadcast reduces this to around 20 milliseconds, a 25x improvement that verl’s implementation demonstrated. NeMo-RL and MILES use CUDA IPC for same-node transfers, reaching sub-millisecond latency. For cross-datacenter scenarios, inclusionAI’s Awex engine and the Mooncake transfer engine implement RDMA-backed P2P transfers, benchmarked at synchronizing a trillion-parameter model across 256 H20 GPUs in roughly 16 seconds.

Staleness management splits libraries across three philosophies. Per-sample version rejection (NeMo-RL, TorchForge) tags each rollout with a model version number and hard-drops samples beyond a threshold; this is theoretically clean but wastes the compute that generated the dropped samples. Depth bounding (SkyRL, Atropos, Tunix) limits queue depth architecturally so staleness cannot exceed a fixed number of training steps. Importance sampling correction (verl, MILES, ROLL, open-instruct) reweights gradient contributions by the ratio of current to old policy log-probabilities, preserving throughput at the cost of increased gradient variance. Production systems like PRIME-RL and AReaL combine all three.

Partial rollout handling is the least discussed axis and the most consequential for agentic workloads. When a weight sync fires mid-generation, a sequence may be half-generated under policy version N and half under version N+1. Most libraries simply block new sequence starts, let in-flight sequences complete under the old weights, and then sync. SkyRL and SLIME implement prefix-resume: they abort in-flight sequences at the sync boundary, save the partial token IDs and KV cache, sync the new weights, and continue generation from the saved prefix under the new policy. PipelineRL takes the most aggressive approach, swapping weights between individual forward passes within a single generation step, meaning consecutive tokens in a sequence may come from different policy versions.

The Correctness Problems Nobody Has Solved Yet

The DeepSeek-V3 training report surfaced two subtle correctness issues that none of the sixteen libraries currently address.

MoE routing inconsistency: the vLLM inference router and the Megatron training router may select different expert subsets for identical inputs due to floating-point rounding and differing implementations. A fix called “Keep Routing” requires recording the exact expert routing decisions during sampling and enforcing those paths during the training forward pass. Without it, the active-parameter subspace seen during training is discontinuous from what generated the tokens, which corrupts gradient estimates silently.

Sampling mask mismatch: inference engines apply top-p or top-k masking to zero out low-probability tokens before sampling. The training forward pass sees the full vocabulary distribution. This violates the importance-sampling identity because the effective action spaces differ. The fix requires returning the truncation mask from the inference server and applying it during the training forward pass.

Both are correctness requirements rather than performance optimizations. Both require extending the current de facto data contract from (token_ids, logprobs, finish_reason) to include routing paths and sampling masks, a breaking change the ecosystem has not yet standardized. Every library that does importance-sampling correction is, in the strict sense, computing incorrect gradient estimates until this is resolved.

TRL’s Choices

The survey closes with the design decisions for TRL’s own upcoming async trainer. The framing is instructive: bounded queue with per-token model version tagging from day one, which avoids the architectural debt of retrofitting token-level provenance after the fact; NCCL with bucketed 1GB transfers for weight sync; and two experimental strategies for partial rollout handling, prefix-resume and abort-and-retry, exposed as configurable options. The explicit rejection of double-buffering as a starting point is notable. A one-batch lookahead feels like the minimal viable async approach, but it locks you into coarse-grained staleness tracking without token-level version information, which you will need anyway once you move to longer agentic rollouts.

The convergence across sixteen teams is a useful calibration signal. The design space is large, the tradeoffs are real, and there are genuine unsolved problems. But the outermost boundary is constrained by GPU physics, and the libraries that will matter in production are the ones that treated that constraint as a first-class architectural requirement from the beginning.

Was this interesting?