· 2 min read ·

Splitting Attention Across GPUs: How Ulysses Makes Million-Token Training Tractable

Source: huggingface

Training a transformer on a million-token context sounds like a hardware problem. It is, but it’s also an algorithm problem — and Ulysses Sequence Parallelism is a clean solution to both at once.

The fundamental issue is that attention scales O(n²) in compute and memory with sequence length. On a single 80GB H100, you hit a wall somewhere around 8K tokens when training a 4B parameter model under ZeRO-3. Anything longer and you’re OOM. This rules out entire categories of useful training: processing full documents, long codebases, extended reasoning chains.

The Head Locality Insight

Ulysses works by exploiting a property of multi-head attention that’s easy to overlook: attention heads are fully independent of each other. You don’t need all heads on the same GPU — you just need all sequence positions for the heads you’re computing.

Here’s the trick:

  1. Shard the input sequence across P GPUs (each GPU holds seq_len / P tokens)
  2. Compute Q, K, V projections locally on each GPU’s chunk
  3. Run an all-to-all collective to redistribute: now each GPU holds all sequence positions but only num_heads / P heads
  4. Compute full attention for those heads using FlashAttention
  5. Another all-to-all to reverse the redistribution
  6. Output projection back on the local sequence chunk

Two all-to-all operations per attention layer. That’s the entire overhead. Communication volume is O(n·d/P) per GPU — P times cheaper than Ring Attention, which does P-1 sequential point-to-point transfers across the ring.

The Numbers Are Hard to Argue With

On 4× H100s training Qwen3-4B, the results from the HuggingFace post are striking:

SetupSequence LengthThroughput
Baseline (1 GPU)8K3,633 tok/s
SP=432K7,733 tok/s
SP=464K13,396 tok/s

That 3.7x throughput improvement at 64K tokens isn’t magic — it’s because attention compute grows quadratically but communication grows linearly, so longer sequences increasingly amortize the all-to-all overhead. The longer your sequences, the better Ulysses looks relative to alternatives.

And critically, the loss curves match. A/B testing against a standard data-parallel baseline showed mean absolute loss differences of 0.005. The trick to getting this right is scaling gradient accumulation steps with SP degree (GAS = SP) so the effective token count per optimizer step stays constant.

What This Looks Like in Practice

The HuggingFace integration is what makes this actually usable rather than a research curiosity. With the transformers Trainer:

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=4,
    sp_handler=DeepSpeedSequenceParallelConfig(
        sp_attn_implementation="flash_attention_2",
    ),
)

The Trainer handles sequence sharding, loss aggregation across ranks, and batch accounting automatically. The main constraints to keep in mind: seq_len % sp_size == 0, num_heads >= sp_size, and you’ll want Flash Attention 2 or 3 (not SDPA).

Worth Paying Attention To

Sequence parallelism has existed in research for a while, but accessible implementations with verified loss equivalence and clean HuggingFace integration change who can actually use it. If you’re training models where context length is a real constraint — document understanding, long-form generation, RAG with many passages — Ulysses is now the pragmatic choice if you have NVLink or InfiniBand interconnects.

The one caveat: it currently requires DeepSpeed as the backend. If you’re on FSDP2, Ring Attention is still your path. But for DeepSpeed users, there’s not much reason to leave 3-4x throughput gains on the table.

Was this interesting?