· 2 min read ·

Million-Token Training Without the Memory Wall: How Ulysses Sequence Parallelism Works

Source: huggingface

Training on long sequences has always been an exercise in fighting your hardware. Attention is quadratic—double your context length, quadruple your memory. FlashAttention helped a lot by making memory linear in sequence length, but that only buys you so much runway. Past ~32K tokens on a single GPU, you’re hitting walls regardless.

Ulysses Sequence Parallelism, now integrated into HuggingFace’s Accelerate ecosystem, is a practical solution to this problem. The HuggingFace writeup covers both the theory and the integration story, and it’s worth understanding why the approach works so well.

The Core Idea

Instead of each GPU processing the full sequence, Ulysses splits the sequence across P GPUs. Each GPU handles n/P tokens. But here’s the clever part: attention needs full sequence context to compute correctly. The trick is an all-to-all communication pattern.

  1. Each GPU computes Q, K, V projections for its token slice
  2. An all-to-all redistributes data so each GPU holds all positions but only a subset of attention heads
  3. Each GPU runs full attention on its head subset (using FlashAttention)
  4. Another all-to-all reverses the redistribution
  5. Output projection continues on the original token shards

You’re essentially trading sequence completeness for head completeness during the attention step itself. The math works out cleanly because attention heads are independent—you can compute them in any order or on any subset of compute.

Why This Beats Ring Attention

Ring Attention is the other common approach: pass K and V around a ring of GPUs while computing partial attention locally. The communication cost there is O(n·d) per GPU through P-1 sequential transfers.

Ulysses does two all-to-all operations at O(n·d/P) per GPU. That’s P times less communication volume, and all-to-all uses full bisectional bandwidth rather than sequential point-to-point. The gap widens as you add GPUs.

What the Numbers Look Like

The benchmarks are on Qwen3-4B with DeepSpeed ZeRO-3 on H100s. Baseline: a single GPU hitting 22.4 GB for 8K tokens. With SP=4 you can push to 96K tokens at 66 GB—about 3.3x memory reduction per GPU at equivalent sequence length, and 12x longer sequences than the baseline configuration.

Throughput actually improves with longer sequences due to the quadratic nature of attention dominating over communication overhead. At 64K tokens they’re seeing 3.7x the tokens/sec versus the 8K baseline.

Integration

The HuggingFace integration makes this relatively approachable:

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=4,
    sp_handler=DeepSpeedSequenceParallelConfig(
        sp_seq_length_is_variable=True,
        sp_attn_implementation="flash_attention_2",
    ),
)

A few hard requirements: sequence length must be divisible by sp_size, you need at least as many attention heads as GPUs in your SP group, and Flash Attention 2 or 3 is non-negotiable (standard attention won’t work).

Why This Matters

Million-token context isn’t exotic anymore—document analysis, long-form code understanding, and retrieval-heavy workloads all benefit from it. The barrier has been hardware cost. Ulysses brings that training regime into reach for teams with 4-8 H100s rather than requiring a cluster. That’s a meaningful shift in who can train models capable of reasoning over genuinely large contexts.

Was this interesting?