GPU Physics Forced Sixteen RL Teams to the Same Architecture

When sixteen independent teams, at ByteDance, Google, NVIDIA, Meta, Alibaba, Tsinghua University, and several smaller organizations, all converge on the same architecture without coordinating, the architecture is probably not a design choice at all. The HuggingFace survey of async RL training systems makes this point by cataloguing the shared pattern across verl, NeMo-RL, Tunix, open-instruct, Atropos, PipelineRL, PRIME-RL, ROLL, SkyRL, SLIME, AReaL, Awex, and others: separate the inference workers generating rollouts from the training workers running backprop, connect them through an asynchronous weight synchronization channel, and let both pools run continuously at their own pace.

The pattern is well understood in distributed RL. It was not invented in 2024.

The 2018 Precedent

DeepMind published IMPALA (Importance Weighted Actor-Learner Architecture) in 2018. The core design separated actors running the current policy to generate trajectories from a learner computing gradient updates. Actors pushed trajectories to a shared queue without waiting for the learner. The learner pulled from that queue and updated the policy without waiting for any individual actor. Updated weights flowed back to actors on a schedule.

The inevitable consequence was staleness: a trajectory might be generated under a policy version two or three optimizer steps older than the current learner checkpoint. IMPALA addressed this with V-trace, an off-policy correction that reweighted trajectory contributions by a clipped importance sampling ratio. Ape-X extended the pattern to distributed prioritized experience replay. R2D2 extended it to recurrent policies. By 2020, the actor-learner split was the default approach for large-scale distributed RL.

The LLM RL community rebuilt this from 2023 onward, largely independently.

Why Rediscovery Was Not Obvious

The structural insight is the same, but the implementation constraints are substantially harder in the LLM case, and the difficulty comes from a specific property of autoregressive generation.

In IMPALA, the policy is a small convolutional network. A single forward pass completes in microseconds. Actors can run on CPUs. The staleness problem is bounded by the ratio of actor throughput to learner throughput, adjustable by scaling actor count. The memory footprint of a single actor is small enough to ignore.

In LLM RL, the policy is a transformer with billions of parameters. Autoregressive generation requires a separate forward pass per output token, with the KV cache for all in-flight sequences held in GPU memory throughout generation. A 32B model generating 512 rollouts of 8,192 tokens on a single H100 takes roughly 56 minutes per batch. Training that same model on the completed rollouts takes under 5 minutes on the same hardware. The generation-to-training time ratio is around 11:1, and it worsens as output length increases; in chain-of-thought reasoning tasks, rollouts regularly reach 32,768 tokens.

The memory-bandwidth constraint creates a fundamental incompatibility when both workloads share a GPU. Generation is memory-bandwidth-bound: arithmetic intensity is low because each forward pass reads the entire KV cache for every token position in every in-flight sequence. Training is compute-bound: the backward pass saturates tensor cores with large matrix multiplications. Inference engines like vLLM and SGLang make memory layout assumptions that FSDP’s all-gather pattern violates. FSDP’s all-gather stalls memory reads in ways that inference KV cache management cannot absorb.

A synchronous baseline serializing generation, reward scoring, and backprop typically achieves 20 to 40 percent GPU utilization across the full RL loop. The remaining wall clock is spent with one side idle.

The Shared Topology

All sixteen libraries in the survey converge on the same logical layout:

Inference Pool (vLLM / SGLang)
        |
        v
Rollout Buffer (bounded FIFO)
        |
        v
Training Pool (FSDP2 / Megatron-Core)
        |
        +-- weight sync --> Inference Pool

Ray handles orchestration for eight of the sixteen libraries. Its actor model maps directly onto RL component structure, with each component type (inference server, trainer, reward model) as a Ray actor with declared resource requirements. The Plasma object store provides zero-copy tensor transfer between workers, which matters when rollout batches run into tens of gigabytes.

The convergence on this topology is genuine. Where teams diverge is on three sub-problems that turn out to have nontrivial engineering surface.

Weight Synchronization at Scale

Naive NCCL calls on individual parameter tensors take 100 to 500 milliseconds for a 32B model. verl (ByteDance) demonstrated that packing parameters into 1GB contiguous buffers before broadcasting reduces this to around 20 milliseconds, a 25x improvement. NeMo-RL and MILES use CUDA IPC for same-node transfers, reaching sub-millisecond latency. Eight of the sixteen libraries implement LoRA adapter-only synchronization, which shrinks the transfer payload from hundreds of gigabytes to around 50 megabytes and transforms synchronization from a bottleneck into a near-zero overhead operation. For multi-datacenter scenarios, Awex from inclusionAI uses RDMA-backed peer-to-peer transfers and benchmarked synchronizing a trillion-parameter model across 256 H20 GPUs in roughly 16 seconds.

Staleness Management

This is where the IMPALA lineage is most visible. The LLM libraries use three approaches, sometimes in combination.

Per-sample version rejection (NeMo-RL, TorchForge) tags each rollout with a model version number and hard-drops samples beyond a staleness threshold. The guarantee is clean but wastes the compute that generated the dropped samples, which becomes significant for long-sequence rollouts.

Depth bounding (SkyRL, Atropos, Tunix) caps queue depth architecturally, so staleness cannot exceed a fixed number of training steps regardless of timing variation. The guarantee is coarser than per-sample tagging but has no compute waste.

Importance sampling correction (verl, MILES, ROLL, open-instruct) reweights gradient contributions by the ratio of the current policy log-probability to the old policy log-probability for each token. ROLL implements six IS variants including CISPO, TIS, and TOPR. Production deployments like PRIME-RL and AReaL combine all three approaches.

Partial Rollout Handling

This sub-problem has no equivalent in classic distributed RL because single-step environment interaction does not produce partial trajectories. When a weight sync fires mid-generation, a sequence in flight may be half-generated under policy version N and half under version N+1.

Most libraries handle this by blocking new sequence starts, letting in-flight sequences complete under the old weights, and then syncing. SkyRL and SLIME implement prefix-resume: abort in-flight sequences at the sync boundary, save the partial token IDs and KV cache state, sync new weights, resume generation from the saved prefix under the updated policy. PipelineRL (ServiceNow) takes the most aggressive approach, swapping weights between individual transformer forward passes within a single generation step, so consecutive tokens in a sequence may come from different policy versions.

The Correctness Debt

The DeepSeek-V3 training report surfaced two structural issues that none of the sixteen libraries currently address.

MoE routing inconsistency: when using mixture-of-experts models, the vLLM inference router and the Megatron training router may select different expert subsets for identical inputs due to floating-point rounding and differing kernel implementations. The active-parameter subspace during training is then discontinuous from what generated the tokens, corrupting gradient estimates without producing obvious failures. The fix, called “Keep Routing,” requires recording exact expert routing decisions during sampling and enforcing those paths during the training forward pass. No open-source library implements this yet.

Sampling mask mismatch: inference engines apply top-p or top-k masking before sampling, but the training forward pass sees the full vocabulary distribution. This violates the importance-sampling identity because the effective action spaces differ. The fix requires returning the truncation mask from the inference server and applying it during the training forward pass.

Both require extending the de facto data contract from (token_ids, logprobs, finish_reason) to include routing paths and sampling masks, a breaking change the ecosystem has not standardized. Until it does, every library that applies importance-sampling correction is computing incorrect gradient estimates in the strict mathematical sense. Whether this matters in practice depends on how often MoE routing diverges across implementations and how aggressively top-p truncates the distribution, questions without published benchmarks.

What Convergence Signals

Convergence like this carries information. When teams with different infrastructure stacks, languages, and organizational constraints all reach the same topology independently, the shape of the solution is being forced by external constraints rather than chosen from a design space.

For video game RL in 2018, the constraint was the throughput mismatch between policy execution and gradient computation. For LLM RL in 2025, the constraint is the same, with additional pressure from autoregressive generation latency, KV cache memory management, and the incompatibility of memory-bandwidth-bound and compute-bound workloads sharing a GPU without degrading both.

The open questions are protocol questions rather than architectural ones. The ecosystem needs to standardize what information the inference server must return for training to be mathematically sound, covering routing paths and sampling masks at minimum. The architecture itself appears settled.