· 7 min read ·

Four Engineering Problems Between a Sparse MoE Model and Agentic RL Training

Source: huggingface

Agentic reinforcement learning, where a model learns by acting in an environment across multiple steps, has matured from a research direction into a production engineering problem. The difference between RLHF-style preference training and true agentic RL extends well beyond the reward signal to the entire data pipeline, the rollout infrastructure, and in several cases, the backward pass itself.

LinkedIn published a retrospective on their work enabling agentic RL for GPT-OSS, their production sparse Mixture-of-Experts model, using the verl framework. The post, which first appeared in January 2026, documents four engineering problems that had to be solved before training was numerically stable at all: a PPO on-policy violation specific to MoE architectures, a training-inference mismatch between vLLM and FSDP, a missing backward pass for attention sinks in FlashAttention V3, and memory constraints that required sequence parallelism. Each fix is instructive in isolation. Together, they reveal where the field’s infrastructure debt actually lives.

What GPT-OSS Is and Why It Creates Problems

GPT-OSS is LinkedIn’s sparse MoE language model, built to achieve the quality of a much larger dense model at the per-token inference cost of roughly a 7B parameter model. The architecture uses a learned router to dispatch each token to a subset of expert FFN layers, keeping FLOPs low while the total parameter count stays high. It also incorporates attention sinks, a mechanism from Xiao et al.’s StreamingLLM paper, where learnable scalar parameters per attention head act as virtual tokens in the softmax computation. These parameters let the model route “attend to nothing” probability mass to a stable target rather than spreading it arbitrarily across content tokens, which dramatically improves KV cache stability across long sequences.

Both features are well-motivated for inference. Sparse MoE reduces per-token FLOP cost. Attention sinks prevent softmax distribution collapse under extended context windows. Neither was designed with RL training in mind, and verl 0.3.0 was not built expecting either of them.

The PPO On-Policy Violation

PPO requires the importance sampling ratio between the new policy and the old policy to equal exactly 1.0 at the start of each update step; that ratio being 1 is what “on-policy” means. In a dense transformer, computing log-probabilities twice over the same inputs produces the same result. In a sparse MoE model, the routing decision is sensitive to floating-point ordering and, in some implementations, includes stochasticity. Two forward passes over identical inputs may dispatch tokens to different experts, producing different log-probabilities for the same generated sequence.

The fix is concise:

if on_policy:
    old_log_prob = log_prob.detach()

By forcing the old log-probability to equal the newly computed one, detached from the graph, the importance ratio is guaranteed to be 1 without changing any token generation behavior. The subtlety is that this resolves only the on-policy case. Off-policy RL requires genuine importance sampling ratios, and MoE routing instability makes those unreliable too, which makes the off-policy regime harder than it looks on paper. LinkedIn’s solution addresses the immediate problem cleanly and defers the more difficult one.

The Training-Inference Stack Divergence

vLLM and SGLang, the standard inference engines for RL rollouts, optimize aggressively for throughput. FSDP and the training-time FlashAttention configuration prioritize numerical precision. When these two stacks compute log-probabilities over the same sequence, they produce values that are close but not equal. This discrepancy is small in any individual computation and large in aggregate: it pushes the importance sampling ratio away from 1 even after the MoE fix, converting nominally on-policy training into something functionally off-policy.

The solution is sequence-level importance sampling as a rollout correction, which accounts for the distributional shift between inference-computed and training-computed log-probabilities. LinkedIn observed that this stabilized gradient norms on single-step tasks like GSM8K. This problem is not specific to GPT-OSS; it is a structural consequence of using separate inference and training kernels, and it will surface for anyone running RL with vLLM or SGLang rollouts against an FSDP training stack. The MoE router alignment paper from late 2025 analyzes a related class of these instabilities and reaches similar conclusions about the need for explicit correction.

The Attention Sink Backward Pass

This is the most technically novel part of the retrospective. Attention sinks in GPT-OSS are implemented as learnable scalar parameters, one per attention head. In the forward pass, each head’s sink parameter is concatenated to the attention score matrix before softmax, and the resulting probability mass attributed to the sink is discarded from the output. The sink absorbs probability without contributing any value vector to the output:

# Standard attention
scores = QK^T / sqrt(d)                         # [B, H, N_q, N_k]
probs  = softmax(scores, dim=-1)
output = probs @ V

# Attention with sink
combined     = concat([scores, sink_param], dim=-1)  # [B, H, N_q, N_k+1]
probs_full   = softmax(combined, dim=-1)
probs_content = probs_full[..., :-1]                 # drop sink column
output = probs_content @ V

The forward pass can be borrowed from vLLM’s FlashAttention fork. The backward pass had not been built.

The gradient through the sink parameter requires computing how its presence in the softmax denominator affected the content token attention weights and therefore the loss. The general chain rule through softmax gives, where P_{i,h} is the probability attributed to the sink and S_h is the sink’s learned scalar:

∂L/∂S_h = -Σ_i P_{i,h} (∂L/∂S_{i,h} - Σ_j P_ij ∂L/∂S_ij)

Because the sink has no corresponding value vector and does not appear in the output, the term ∂L/∂S_{i,h} is zero. The gradient simplifies to:

∂L/∂S_h = -Σ_i P_{i,h} Σ_j P_ij ∂L/∂S_ij

In concrete terms: the sink parameter’s gradient is the negative of the attention-weighted sum of content-score gradients, further weighted by the sink’s own probability. The sink learns in proportion to how much probability mass it absorbed and how that absorption shaped the distribution over content tokens. The sign is negative because redirecting mass to the sink reduces the probability over content tokens; gradient descent will adjust the sink to increase or decrease this redirection depending on whether the resulting output improved the loss.

Without this backward pass, the sink parameters receive no gradient during RL training. LinkedIn observed complete training collapse on instruction-following tasks and non-convergence on the ReTool multi-turn agentic coding benchmark without it. A training loop that silently skips gradient updates for part of the model is broken in a way that may not be immediately apparent from loss curves alone, since the model still trains on everything else.

FlashAttention V2 has no attention sink support at all. FlashAttention V3, as of verl 0.3.0, had a forward pass available from vLLM’s fork but no backward pass. The LinkedIn team implemented the backward pass themselves and indicated plans to contribute it to the main FlashAttention V3 repository following internal review.

Sequence Parallelism for Agentic Context Windows

Agentic RL demands long context windows. A model interacting with a code interpreter across multiple turns, reading tool outputs, writing intermediate results, accumulates history that single-turn RLHF would never require. With 16 H200 nodes and FSDP alone, even the 20B GPT-OSS variant hit memory limits at 16k response length combined with 8k prompt context.

Sequence parallelism distributes the sequence dimension across GPUs, reducing per-GPU activation memory proportionally. Attention is the complication: each query must attend to all keys and values in the full sequence, which requires all-to-all communication before and after the attention computation. Non-attention layers, including the MoE FFN layers, proceed without additional synchronization on their local sequence chunks.

The implementation has to handle attention sinks correctly under sequence parallelism. Sink parameters are per-head scalars, but their contribution to the softmax denominator must be computed consistently across distributed key-value chunks. The team integrated sequence parallelism with their FlashAttention V3 sink support and validated stable scaling on the 20B model at the context lengths that multi-turn agentic tasks require. Figures from the retrospective show the 120B variant also benefiting from the attention sink fix, suggesting the corrections generalize across model scales.

What the Retrospective Actually Documents

The problems described here are reproducible. Any organization training a sparse MoE model with RL, using vLLM or SGLang for rollouts and FSDP for training, will encounter the same PPO integrity problem and the same stack divergence. Any model incorporating learnable attention sink parameters, a design choice likely to spread given its demonstrated benefits for long-context inference stability, will require a working backward pass through those sinks before RL training is correct.

verl’s HybridEngine co-locates training and inference workers to reduce synchronization overhead and weight-copy costs, which is a genuine improvement over fully separate pipeline architectures. Even with that design, the gap between inference-optimized and training-optimized code paths requires explicit correction. The corrections documented in this retrospective are not workarounds for unusual edge cases; they address the fundamental tension between what inference systems optimize for and what training systems require from the same model. The backward pass for attention sinks is the clearest example: an architectural feature added to improve inference behavior in production required custom gradient derivation before it could participate in learning at all. That gap between shipping a feature and training through it is where the current work on agentic RL infrastructure is happening.

Was this interesting?