· 2 min read ·

Million-Token Training Without the Memory Wall: A Look at Ulysses Sequence Parallelism

Source: huggingface

Training language models on long contexts has always been the expensive cousin of the AI workload family. Attention is O(n²) in memory — double the sequence length, quadruple the memory. Want to fine-tune on 96K token documents? Good luck fitting that on a single GPU.

The HuggingFace Ulysses Sequence Parallelism post is worth reading carefully if you do any long-context training. The core idea comes from DeepSpeed’s Ulysses paper: instead of replicating the full sequence on every GPU, you shard it across devices and use all-to-all communication to handle attention.

How the Sharding Actually Works

Here’s what makes Ulysses clever. Standard data parallelism gives every GPU the full sequence. Ulysses instead:

  1. Splits the sequence across P GPUs along the sequence dimension
  2. Each GPU computes Q/K/V projections for its local chunk
  3. An all-to-all transpose redistributes so each GPU holds all sequence positions for a subset of attention heads
  4. Standard FlashAttention runs locally per GPU
  5. Another all-to-all reverses the redistribution

The communication cost is O(n·d/P) per GPU — P times less than Ring Attention’s sequential P2P transfers. That’s not a small difference at scale.

The Numbers

On 4× H100s, the benchmarks are striking:

ConfigSeq LengthPeak Memory
1 GPU baseline8K22.4 GB
SP=48K22.8 GB
SP=464K50.5 GB
SP=496K66.0 GB

Same 4-GPU setup, 12x longer sequences. Throughput also scales well — at 64K tokens you’re getting 3.7x the tokens per second compared to the single-GPU 8K baseline, because you can pack more useful compute into each step.

Critically, training dynamics stay equivalent to data parallelism. The mean absolute loss difference on matched token budgets is 0.0054. You’re not sacrificing convergence quality.

Integration is Refreshingly Simple

What actually makes this practical is the HuggingFace integration. With accelerate>=1.12 and deepspeed>=0.18.1, it’s a config change:

from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=4,
    sp_handler=DeepSpeedSequenceParallelConfig(
        sp_seq_length_is_variable=True,
        sp_attn_implementation="flash_attention_2",
    ),
)

Pass that into Accelerator, TrainingArguments, or SFTConfig and you’re done. The two constraints to remember: sequence length must be divisible by sp_size, and you need at least as many attention heads as GPUs in your SP group.

When to Reach for This

If your sequences fit on a single GPU, don’t bother — the overhead isn’t worth it. But if you’re doing book-length document understanding, large codebase analysis, or multi-document RAG fine-tuning, this is the tool. The post also covers combining Ulysses with DeepSpeed ZeRO-3 for large models and 2D parallelism (mixing SP and DP across a GPU pool), which is where production training rigs would actually land.

My takeaway: the hard distributed systems work is done. The interesting question now is what training tasks actually benefit from 96K+ token contexts that weren’t practical before.

Was this interesting?