· 3 min read ·

Sequence Parallelism That Actually Works: Breaking Down Ulysses

Source: huggingface

Training on long sequences has always been the wall you eventually hit. Attention is O(n²) in memory and compute, and once you push past tens of thousands of tokens, a single GPU just runs out of headroom. Data parallelism doesn’t help here — each GPU still has to hold the full sequence for attention computation. So how do you get to a million tokens?

The HuggingFace team just published a detailed walkthrough of Ulysses Sequence Parallelism, now integrated into Accelerate, Transformers Trainer, and TRL’s SFTTrainer. The results are genuinely impressive and the design is clean enough that it’s worth understanding how it works rather than just copying the config.

The Core Idea

Ulysses splits the sequence across GPUs along the sequence dimension. Each GPU computes Q, K, V projections for its local chunk. Then — and this is the key move — an all-to-all collective redistributes the data so each GPU now holds all sequence positions but only a subset of attention heads. Standard FlashAttention runs locally on each GPU. Another all-to-all brings it back to sequence-sharded form.

Two collectives. That’s it.

Compare that to Ring Attention, which requires P-1 sequential point-to-point transfers. Ulysses does O(n·d/P) communication per GPU vs O(n·d) for ring. At scale that gap matters a lot, and all-to-all gets to use full bisectional bandwidth in a single step rather than chaining point-to-points.

What the Numbers Look Like

Benchmarks on 4x H100 80GB with Qwen3-4B:

  • At 8K tokens, SP=4 and DP=4 use roughly the same memory (~22 GB). No free lunch there.
  • At 32K tokens with SP=4, you’re getting 2.1x throughput vs the single-GPU baseline.
  • At 64K tokens: 3.7x throughput, and you’re processing sequences 12x longer than what fits on a single GPU at all.

The loss curves match data-parallel training within logging precision, as long as you scale gradient accumulation steps with sp_size to equalize the token budget. That equivalence check matters — it confirms you’re not just getting faster training at the cost of different dynamics.

Using It

The integration is refreshingly simple. With SFTTrainer:

from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=4,
    sp_handler=DeepSpeedSequenceParallelConfig(
        sp_attn_implementation="flash_attention_2",
    ),
)

training_args = SFTConfig(
    parallelism_config=parallelism_config,
    max_length=32768,
    pad_to_multiple_of=2,  # must equal sp_size
    packing=True,
)

Two hard constraints to keep in mind: sequence_length % sp_size == 0, and you need at least as many attention heads as GPUs in the SP group. Most modern models have 32+ heads, so the latter rarely bites you.

The 2D Parallelism Angle

The more interesting deployment pattern is combining SP with DeepSpeed ZeRO Stage 3 in a 2D grid. On 8 GPUs you might run sp_size=4 with dp_shard_size=2, getting both sequence length extension and model parameter sharding simultaneously. For fine-tuning large models on long documents, this is probably the configuration worth reaching for first.

Why This Matters Now

A year ago, training on 64K+ token sequences required either expensive custom infrastructure or accepting that your fine-tune would be on truncated data. Ulysses landing in mainstream tooling — with proper loss equivalence validation and integration into the Trainer APIs most people already use — removes a real barrier. Document-length and book-length contexts are no longer a hardware problem if you have a few GPUs to throw at it.

The benchmark that sticks with me: 3.7x throughput at 64K tokens on 4 H100s, with identical training dynamics. That’s not a research result you have to reproduce yourself. It’s in the Accelerate config today.

Was this interesting?