arxiv: 2309.14509 · v2 · submitted 2023-09-25 · 💻 cs.LG · cs.CL· cs.DC

Recognition: 2 theorem links

· Lean Theorem

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs , Masahiro Tanaka , Chengming Zhang , Minjia Zhang , Shuaiwen Leon Song , Samyam Rajbhandari , Yuxiong He

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.DC

keywords sequence parallelismlong sequence trainingTransformer optimizationall-to-all communicationLLM scalingDeepSpeedattention computationsystem optimizations

0 comments

The pith

DeepSpeed-Ulysses trains Transformer models with 4x longer sequences 2.5x faster than prior methods by keeping communication volume constant through sequence partitioning and all-to-all attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepSpeed-Ulysses to overcome memory and communication limits in training large language models on extremely long input sequences. It partitions the sequence dimension of the input across devices and replaces standard attention with an efficient all-to-all collective that moves only the necessary key and value blocks. Theoretical analysis shows communication volume stays constant when sequence length and device count grow together, unlike earlier sequence-parallel approaches whose overhead rises with length. Experiments demonstrate the method runs 2.5 times faster while supporting four times longer sequences than the previous state-of-the-art baseline. This directly addresses the practical need for long-context models in applications such as document understanding and extended dialogue.

Core claim

DeepSpeed-Ulysses partitions the input sequence across devices and performs attention via an efficient all-to-all collective communication pattern. This design yields constant communication volume when sequence length scales proportionally with the number of devices, whereas prior sequence-parallel methods incur growing communication cost. The approach is portable across Transformer architectures and delivers 2.5 times faster training at four times the sequence length relative to the existing strongest baseline.

What carries the argument

Sequence-dimension partitioning paired with all-to-all collective communication for attention, which holds communication volume fixed as both sequence length and device count increase together.

If this is right

Sequence length can be increased by adding devices without a proportional rise in communication overhead.
Long-context applications become feasible on current hardware without model-quality trade-offs.
The method can be combined with existing data, tensor, and pipeline parallelism for still larger models.
Training throughput improves for any Transformer that uses self-attention on long inputs.
Memory per device for activations drops because only local sequence segments are stored during the forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models could process entire books or hours-long conversations in a single forward pass once hardware supports the proportional device scaling.
Energy cost per token may fall at scale because communication no longer grows with length.
The same partitioning pattern could extend to other attention variants such as sparse or linear attention without redesigning the collective.
Integration with existing DeepSpeed features would allow users to enable long-sequence training with a single configuration flag.

Load-bearing premise

All-to-all communication remains efficient and does not become the dominant bottleneck at extreme scale, while splitting the sequence across devices preserves final model quality without extra adjustments.

What would settle it

A scaling experiment on thousands of GPUs showing all-to-all time overtaking compute time, or a quality comparison where partitioned training produces measurably lower accuracy or perplexity than an equivalent non-partitioned run at the same sequence length.

read the original abstract

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ulysses keeps communication volume constant under proportional scaling of sequence length and devices, which is the real advance, though all-to-all latency at large device counts still needs checking.

read the letter

The core contribution is a sequence-parallelism scheme that splits the input along the sequence dimension and routes attention through an all-to-all collective. This produces constant communication volume when sequence length grows in proportion to the number of devices, unlike earlier sequence-parallel approaches whose costs scale with length. The paper supplies the volume analysis and reports a 2.5x training speedup at 4x longer sequences versus the prior best baseline. That combination of theory and measured gain is what makes the work worth reading for anyone training long-context models. The method is also presented as portable, which increases its practical value. Experiments appear to use standard transformer setups, so the gains are not tied to exotic model changes. The constant-volume property follows directly from keeping the per-device sequence chunk fixed while the all-to-all message size stays the same. That part of the argument is clean. The main soft spot is the one raised in the stress-test note: communication time is not determined by volume alone. All-to-all collectives also incur startup latency and potential contention that grow with device count. The reported speedups were likely measured at modest scale; it is not obvious the same 2.5x factor will hold at 512 or 1024 devices with million-token sequences. The paper does not appear to include detailed scaling curves or network-contention measurements at those extremes. Convergence and final model quality are asserted to be preserved, but the text gives limited detail on whether any extra stabilization was required. This is a practical systems paper aimed at researchers and engineers who need to train transformers on long documents or multi-turn contexts. It supplies a concrete technique plus numbers that can be reproduced on existing hardware. The central claim is internally consistent and the problem it attacks is real, so the work deserves a serious referee even if the large-scale latency question requires additional data in revision.

Referee Report

1 major / 2 minor

Summary. The paper introduces DeepSpeed-Ulysses, a sequence-parallelism technique for Transformer training that partitions input sequences across devices and uses an all-to-all collective to compute attention. It presents a theoretical communication-volume analysis showing that, unlike prior sequence-parallel methods, communication volume remains constant when sequence length S scales proportionally with the number of devices N (keeping per-device sequence length fixed). Empirical results claim a 2.5x training speedup with 4x longer sequences relative to the prior SOTA baseline.

Significance. If the central claims hold, the work is significant because it directly targets the sequence-length dimension that existing data, tensor, and pipeline parallelism do not optimize. The constant-volume communication result is a clean theoretical contribution that follows from standard collective properties, and the reported speedup is a concrete, falsifiable performance claim against an external baseline. These elements together address a practical bottleneck for long-context LLMs.

major comments (1)

[Communication analysis (abstract and §3)] Communication analysis (abstract and §3): the argument correctly shows constant volume under S ∝ N scaling because local sequence length per device is fixed, but it does not address whether all-to-all latency remains constant as the number of peers N grows (startup overhead, network contention, and collective implementation costs can increase with N even at fixed message size). This is load-bearing for the scalability claim at extreme scales (e.g., N=1024).

minor comments (2)

[Abstract and experimental evaluation] The abstract and experimental section should explicitly state the exact baseline implementation, model sizes, hardware configuration, and whether error bars or multiple runs are reported for the 2.5x speedup figure.
[Method section] Notation for sequence length S, hidden dimension, and device count N should be introduced consistently in the first paragraph of the method section to aid readability.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Communication analysis (abstract and §3)] Communication analysis (abstract and §3): the argument correctly shows constant volume under S ∝ N scaling because local sequence length per device is fixed, but it does not address whether all-to-all latency remains constant as the number of peers N grows (startup overhead, network contention, and collective implementation costs can increase with N even at fixed message size). This is load-bearing for the scalability claim at extreme scales (e.g., N=1024).

Authors: We agree with the referee that our analysis in the abstract and Section 3 addresses only communication volume. Under S ∝ N scaling with fixed per-device sequence length, the all-to-all message size per link stays constant, so volume does not grow. The analysis does not claim or prove that end-to-end latency remains constant; factors such as collective startup overhead, network contention, and implementation costs can indeed increase with N even at fixed message size. We will revise the manuscript to explicitly limit the scope of the theoretical claim to volume and to add a short discussion acknowledging these latency considerations at large N. Our empirical speedups are measured at the scales we evaluated; we do not assert that the same gains will hold without further tuning at N=1024. revision: partial

standing simulated objections not resolved

Direct measurement or modeling of all-to-all latency at N=1024; our experiments do not reach that scale.

Circularity Check

0 steps flagged

No circularity; communication analysis derives directly from collective properties

full rationale

The paper's central theoretical result is that DeepSpeed-Ulysses keeps all-to-all communication volume constant when sequence length S scales proportionally with device count N (local sequence length per device remains fixed). This follows immediately from the definition of sequence partitioning plus the standard semantics of all-to-all collectives; no parameter is fitted and then renamed as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The reported 2.5x speedup is an external empirical measurement against a published baseline rather than a quantity forced by the paper's own inputs. The derivation chain is therefore self-contained and does not reduce to its own assumptions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard distributed-computing primitives (all-to-all collectives) and the usual assumptions of data-parallel training; no new fitted parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5559 in / 1107 out tokens · 38522 ms · 2026-05-13T01:02:09.384868+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
cs.DC 2026-04 unverdicted novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
Ring Attention with Blockwise Transformers for Near-Infinite Context
cs.CL 2023-10 unverdicted novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
cs.DC 2026-05 unverdicted novelty 6.0

ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
cs.DC 2026-05 unverdicted novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
cs.LG 2026-04 unverdicted novelty 6.0

CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
cs.AI 2026-04 unverdicted novelty 6.0

Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and ...
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
cs.DC 2026-04 unverdicted novelty 6.0

CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
Geometric Context Transformer for Streaming 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
LPM 1.0: Video-based Character Performance Model
cs.CV 2026-04 unverdicted novelty 6.0

LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
cs.CV 2026-04 conditional novelty 6.0

LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
cs.AR 2026-04 conditional novelty 6.0

DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
cs.DC 2026-04 unverdicted novelty 6.0

GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
cs.LG 2026-05 unverdicted novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 5.0

ResiHP improves LLM training throughput by 1.04-4.39x under hardware failures by using a workload-aware execution time predictor to avoid false failure detections and a scheduler that dynamically changes parallelism g...
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
cs.LG 2026-04 unverdicted novelty 5.0

SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 4.0

ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips
cs.DC 2026-05 unverdicted novelty 4.0

On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.
Seedance 1.0: Exploring the Boundaries of Video Generation Models
cs.CV 2025-06 unverdicted novelty 4.0

Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

174 extracted references · 174 canonical work pages · cited by 25 Pith papers · 17 internal anchors

[1]

Peters, and Arman Cohan

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020

work page 2020
[2]

Booksum: A collection of datasets for long-form narrative summarization, 2022

Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization, 2022

work page 2022
[3]

Introducing mpt-7b: A new standard for open-source, commercially usable llms

MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms. https://https://www.mosaicml.com/blog/mpt-7b, 2023

work page 2023
[4]

Effective long-context scaling of foundation models, 2023

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models, 2023

work page 2023
[5]

Yarn: Efficient context window extension of large language models, 2023

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models, 2023

work page 2023
[6]

Llama 2: Open foundation and fine-tuned chat models, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023
[7]

Gupta, and Aditya Grover

Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K. Gupta, and Aditya Grover. Climax: A foundation model for weather and climate, 2023

work page 2023
[8]

Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics

Maxim Zvyagin, Alexander Brace, Kyle Hippe, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, et al. Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics. bioRxiv, pages 2022--10, 2022

work page 2022
[10]

Todd Young, John Gounley, Noah Schaefferkoetter, Hong Jun Yoon, Xiao-Cheng Wu, Eric B

Shang Gao, Mohammed Alawad, M. Todd Young, John Gounley, Noah Schaefferkoetter, Hong Jun Yoon, Xiao-Cheng Wu, Eric B. Durbin, Jennifer Doherty, Antoinette Stroup, Linda Coyle, and Georgia Tourassi. Limitations of transformers on clinical text classification. IEEE Journal of Biomedical and Health Informatics, 25 0 (9): 0 3596--3607, 2021. doi:10.1109/JBHI....

work page doi:10.1109/jbhi.2021.3062322 2021
[11]

Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

work page 2023
[12]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017

work page 2017
[13]

Large scale distributed deep networks

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012

work page 2012
[14]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

work page 2020
[15]

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '21, 2021

work page 2021
[17]

Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

work page 2019
[18]

Pipedream: Generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In ACM Symposium on Operating Systems Principles (SOSP 2019), October 2019

work page 2019
[19]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. ArXiv, abs/1811.06965, 2018

work page Pith review arXiv 2018
[20]

Memory-efficient pipeline-parallel dnn training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937--7947. PMLR, 2021

work page 2021
[21]

DeepSpeed : Extreme-scale model training for everyone

DeepSpeed Team and Rangan Majumder. DeepSpeed : Extreme-scale model training for everyone. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/, 2020

work page 2020
[23]

Zero++: Extremely efficient collective communication for giant model training, 2023

Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He. Zero++: Extremely efficient collective communication for giant model training, 2023

work page 2023
[24]

Demystifying parallel and distributed deep learning: An in-depth concurrency analysis

Tal Ben-Nun and Torsten Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52 0 (4): 0 1--43, 2019

work page 2019
[25]

Sequence parallelism: Long sequence training from system perspective, 2022 b

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective, 2022 b

work page 2022
[26]

Reducing activation recomputation in large transformer models, 2022

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022

work page 2022
[27]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[29]

Big bird: Transformers for longer sequences, 2021

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021

work page 2021
[30]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

work page 2022
[31]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[32]

Zur Elektrodynamik bewegter K \"o rper

Albert Einstein. Zur Elektrodynamik bewegter K \"o rper . ( German ) [ On the electrodynamics of moving bodies]. Annalen der Physik. 1905. doi:http://dx.doi.org/10.1002/andp.19053221004

work page doi:10.1002/andp.19053221004 1905
[33]

2023 , eprint=

Effective Long-Context Scaling of Foundation Models , author=. 2023 , eprint=

work page 2023
[34]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[35]

2023 , eprint=

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training , author=. 2023 , eprint=

work page 2023
[36]

2023 , eprint=

YaRN: Efficient Context Window Extension of Large Language Models , author=. 2023 , eprint=

work page 2023
[37]

Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , booktitle =

Ritchie Zhao and Yuwei Hu and Jordan Dotzel and Christopher De Sa and Zhiru Zhang , editor =. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , booktitle =. 2019 , url =

work page 2019
[38]

The Tenth International Conference on Learning Representations,

Tim Dettmers and Mike Lewis and Sam Shleifer and Luke Zettlemoyer , title =. The Tenth International Conference on Learning Representations,. 2022 , url =

work page 2022
[40]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , title =. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =. 2020 , isbn =

work page 2020
[41]

Rethinking Attention with Performers

Krzysztof Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Andreea Gane and Tam. Rethinking Attention with Performers , journal =. 2020 , url =. 2009.14794 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2020
[42]

2021 , eprint=

Big Bird: Transformers for Longer Sequences , author=. 2021 , eprint=

work page 2021
[43]

2020 , eprint=

Longformer: The Long-Document Transformer , author=. 2020 , eprint=

work page 2020
[44]

2022 , eprint=

BookSum: A Collection of Datasets for Long-form Narrative Summarization , author=. 2022 , eprint=

work page 2022
[45]

2022 , eprint=

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

work page 2022
[46]

2023 , eprint=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. 2023 , eprint=

work page 2023
[47]

2022 , eprint=

Sequence Parallelism: Long Sequence Training from System Perspective , author=. 2022 , eprint=

work page 2022
[48]

2023 , eprint=

ClimaX: A foundation model for weather and climate , author=. 2023 , eprint=

work page 2023
[49]

2022 , eprint=

Reducing Activation Recomputation in Large Transformer Models , author=. 2022 , eprint=

work page 2022
[50]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[52]

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Pipedream: Fast and efficient pipeline parallel dnn training , author=. arXiv preprint arXiv:1806.03377 , year=

work page Pith review arXiv
[53]

Advances in neural information processing systems , volume=

Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , volume=

work page
[54]

International Conference on Machine Learning , pages=

Memory-efficient pipeline-parallel dnn training , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[55]

ACM Computing Surveys (CSUR) , volume=

Demystifying parallel and distributed deep learning: An in-depth concurrency analysis , author=. ACM Computing Surveys (CSUR) , volume=. 2019 , publisher=

work page 2019
[56]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

On large-batch training for deep learning: Generalization gap and sharp minima , author=. arXiv preprint arXiv:1609.04836 , year=

work page internal anchor Pith review arXiv
[57]

Advances in neural information processing systems , volume=

Large scale distributed deep networks , author=. Advances in neural information processing systems , volume=

work page
[58]

arXiv preprint arXiv:2102.02888 , year=

Hanlin Tang and Shaoduo Gan and Ammar Ahmad Awan and Samyam Rajbhandari and Conglong Li and Xiangru Lian and Ji Liu and Ce Zhang and Yuxiong He , title =. CoRR , volume =. 2021 , url =. 2102.02888 , timestamp =

work page arXiv 2021
[59]

arXiv preprint arXiv:2104.06069 , year=

Conglong Li and Ammar Ahmad Awan and Hanlin Tang and Samyam Rajbhandari and Yuxiong He , title =. CoRR , volume =. 2021 , url =. 2104.06069 , timestamp =

work page arXiv 2021
[60]

Proceedings of the Workshop on Machine Learning in High Performance Computing Environments , pages =

Dryden, Nikoli and Moon, Tim and Jacobs, Sam Ade and Van Essen, Brian , title =. Proceedings of the Workshop on Machine Learning in High Performance Computing Environments , pages =. 2016 , isbn =

work page 2016
[61]

Fifteenth annual conference of the international speech communication association , year=

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns , author=. Fifteenth annual conference of the international speech communication association , year=

work page
[62]

Todd and Gounley, John and Schaefferkoetter, Noah and Yoon, Hong Jun and Wu, Xiao-Cheng and Durbin, Eric B

Gao, Shang and Alawad, Mohammed and Young, M. Todd and Gounley, John and Schaefferkoetter, Noah and Yoon, Hong Jun and Wu, Xiao-Cheng and Durbin, Eric B. and Doherty, Jennifer and Stroup, Antoinette and Coyle, Linda and Tourassi, Georgia , journal=. Limitations of Transformers on Clinical Text Classification , year=

work page
[63]

Wehbe and Faraz S

Yikuan Li and Ramsey M. Wehbe and Faraz S. Ahmad and Hanyin Wang and Yuan Luo , title =. CoRR , volume =. 2022 , url =. 2201.11838 , timestamp =

work page arXiv 2022
[64]

Scalable distributed DNN training using commodity GPU cloud computing , author=

work page
[65]

Advances in neural information processing systems , volume=

QSGD: Communication-efficient SGD via gradient quantization and encoding , author=. Advances in neural information processing systems , volume=

work page
[66]

The Journal of Machine Learning Research , volume=

Quantized neural networks: Training neural networks with low precision weights and activations , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=

work page 2017
[67]

International conference on machine learning , pages=

Deep learning with limited numerical precision , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[68]

8-Bit Approximations for Parallelism in Deep Learning

8-bit approximations for parallelism in deep learning , author=. arXiv preprint arXiv:1511.04561 , year=

work page Pith review arXiv
[69]

2017 , publisher=

NVIDIA Collective Communications Library (NCCL) , author=. 2017 , publisher=

work page 2017
[70]

The International Journal of High Performance Computing Applications , volume=

Optimization of collective communication operations in MPICH , author=. The International Journal of High Performance Computing Applications , volume=. 2005 , publisher=

work page 2005
[71]

Concurrency and Computation: Practice and Experience , volume=

Collective communication: theory, practice, and experience , author=. Concurrency and Computation: Practice and Experience , volume=. 2007 , publisher=

work page 2007
[72]

2022 , eprint=

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud , author=. 2022 , eprint=

work page 2022
[73]

2022 , eprint=

Datasheet for the Pile , author=. 2022 , eprint=

work page 2022
[74]

Interspeech 2014 , year =

Seide, Frank and Fu, Hao and Droppo, Jasha and Li, Gang and Yu, Dong , title =. Interspeech 2014 , year =

work page 2014
[75]

NVIDIA InfiniBand Adaptive Routing Technology , howpublished =

work page
[76]

NVIDIA TESLA V100 GPU ACCELERATOR , howpublished =

work page
[77]

Quantization - PyTorch documentation , howpublished =

work page
[78]

Devanur and Ion Stoica , editor =

Guanhua Wang and Shivaram Venkataraman and Amar Phanishayee and Jorgen Thelin and Nikhil R. Devanur and Ion Stoica , editor =. Blink: Fast and Generic Collectives for Distributed. Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020 , publisher =. 2020 , url =

work page 2020
[79]

2020 , howpublished =

DeepSpeed Team and Rangan Majumder , title =. 2020 , howpublished =

work page 2020
[80]

ICPP , year=

Scans as Primitive Parallel Operations , author=. ICPP , year=

work page
[81]

XLA: Optimizing Compiler for Machine Learning , url =

work page
[82]

NVIDIA CUTLASS , url =

work page
[83]

13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) , pages=

\ TVM \ : An Automated \ End-to-End \ Optimizing Compiler for Deep Learning , author=. 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) , pages=

work page
[84]

Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , year=

work page
[85]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , author=. arXiv preprint arXiv:2201.11990 , year=

work page Pith review arXiv

Showing first 80 references.