DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
CoRR , volume =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
Hidden layer distillation yields systematic perplexity gains over logit KD in LLM pre-training but does not consistently improve downstream performance.
citing papers explorer
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
A Study on Hidden Layer Distillation for Large Language Model Pre-Training
Hidden layer distillation yields systematic perplexity gains over logit KD in LLM pre-training but does not consistently improve downstream performance.