pith. machine review for the scientific record. sign in

arxiv: 2309.14509 · v2 · submitted 2023-09-25 · 💻 cs.LG · cs.CL· cs.DC

Recognition: 2 theorem links

· Lean Theorem

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.DC
keywords sequence parallelismlong sequence trainingTransformer optimizationall-to-all communicationLLM scalingDeepSpeedattention computationsystem optimizations
0
0 comments X

The pith

DeepSpeed-Ulysses trains Transformer models with 4x longer sequences 2.5x faster than prior methods by keeping communication volume constant through sequence partitioning and all-to-all attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepSpeed-Ulysses to overcome memory and communication limits in training large language models on extremely long input sequences. It partitions the sequence dimension of the input across devices and replaces standard attention with an efficient all-to-all collective that moves only the necessary key and value blocks. Theoretical analysis shows communication volume stays constant when sequence length and device count grow together, unlike earlier sequence-parallel approaches whose overhead rises with length. Experiments demonstrate the method runs 2.5 times faster while supporting four times longer sequences than the previous state-of-the-art baseline. This directly addresses the practical need for long-context models in applications such as document understanding and extended dialogue.

Core claim

DeepSpeed-Ulysses partitions the input sequence across devices and performs attention via an efficient all-to-all collective communication pattern. This design yields constant communication volume when sequence length scales proportionally with the number of devices, whereas prior sequence-parallel methods incur growing communication cost. The approach is portable across Transformer architectures and delivers 2.5 times faster training at four times the sequence length relative to the existing strongest baseline.

What carries the argument

Sequence-dimension partitioning paired with all-to-all collective communication for attention, which holds communication volume fixed as both sequence length and device count increase together.

If this is right

  • Sequence length can be increased by adding devices without a proportional rise in communication overhead.
  • Long-context applications become feasible on current hardware without model-quality trade-offs.
  • The method can be combined with existing data, tensor, and pipeline parallelism for still larger models.
  • Training throughput improves for any Transformer that uses self-attention on long inputs.
  • Memory per device for activations drops because only local sequence segments are stored during the forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could process entire books or hours-long conversations in a single forward pass once hardware supports the proportional device scaling.
  • Energy cost per token may fall at scale because communication no longer grows with length.
  • The same partitioning pattern could extend to other attention variants such as sparse or linear attention without redesigning the collective.
  • Integration with existing DeepSpeed features would allow users to enable long-sequence training with a single configuration flag.

Load-bearing premise

All-to-all communication remains efficient and does not become the dominant bottleneck at extreme scale, while splitting the sequence across devices preserves final model quality without extra adjustments.

What would settle it

A scaling experiment on thousands of GPUs showing all-to-all time overtaking compute time, or a quality comparison where partitioned training produces measurably lower accuracy or perplexity than an equivalent non-partitioned run at the same sequence length.

read the original abstract

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DeepSpeed-Ulysses, a sequence-parallelism technique for Transformer training that partitions input sequences across devices and uses an all-to-all collective to compute attention. It presents a theoretical communication-volume analysis showing that, unlike prior sequence-parallel methods, communication volume remains constant when sequence length S scales proportionally with the number of devices N (keeping per-device sequence length fixed). Empirical results claim a 2.5x training speedup with 4x longer sequences relative to the prior SOTA baseline.

Significance. If the central claims hold, the work is significant because it directly targets the sequence-length dimension that existing data, tensor, and pipeline parallelism do not optimize. The constant-volume communication result is a clean theoretical contribution that follows from standard collective properties, and the reported speedup is a concrete, falsifiable performance claim against an external baseline. These elements together address a practical bottleneck for long-context LLMs.

major comments (1)
  1. [Communication analysis (abstract and §3)] Communication analysis (abstract and §3): the argument correctly shows constant volume under S ∝ N scaling because local sequence length per device is fixed, but it does not address whether all-to-all latency remains constant as the number of peers N grows (startup overhead, network contention, and collective implementation costs can increase with N even at fixed message size). This is load-bearing for the scalability claim at extreme scales (e.g., N=1024).
minor comments (2)
  1. [Abstract and experimental evaluation] The abstract and experimental section should explicitly state the exact baseline implementation, model sizes, hardware configuration, and whether error bars or multiple runs are reported for the 2.5x speedup figure.
  2. [Method section] Notation for sequence length S, hidden dimension, and device count N should be introduced consistently in the first paragraph of the method section to aid readability.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Communication analysis (abstract and §3)] Communication analysis (abstract and §3): the argument correctly shows constant volume under S ∝ N scaling because local sequence length per device is fixed, but it does not address whether all-to-all latency remains constant as the number of peers N grows (startup overhead, network contention, and collective implementation costs can increase with N even at fixed message size). This is load-bearing for the scalability claim at extreme scales (e.g., N=1024).

    Authors: We agree with the referee that our analysis in the abstract and Section 3 addresses only communication volume. Under S ∝ N scaling with fixed per-device sequence length, the all-to-all message size per link stays constant, so volume does not grow. The analysis does not claim or prove that end-to-end latency remains constant; factors such as collective startup overhead, network contention, and implementation costs can indeed increase with N even at fixed message size. We will revise the manuscript to explicitly limit the scope of the theoretical claim to volume and to add a short discussion acknowledging these latency considerations at large N. Our empirical speedups are measured at the scales we evaluated; we do not assert that the same gains will hold without further tuning at N=1024. revision: partial

standing simulated objections not resolved
  • Direct measurement or modeling of all-to-all latency at N=1024; our experiments do not reach that scale.

Circularity Check

0 steps flagged

No circularity; communication analysis derives directly from collective properties

full rationale

The paper's central theoretical result is that DeepSpeed-Ulysses keeps all-to-all communication volume constant when sequence length S scales proportionally with device count N (local sequence length per device remains fixed). This follows immediately from the definition of sequence partitioning plus the standard semantics of all-to-all collectives; no parameter is fitted and then renamed as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The reported 2.5x speedup is an external empirical measurement against a published baseline rather than a quantity forced by the paper's own inputs. The derivation chain is therefore self-contained and does not reduce to its own assumptions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard distributed-computing primitives (all-to-all collectives) and the usual assumptions of data-parallel training; no new fitted parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5559 in / 1107 out tokens · 38522 ms · 2026-05-13T01:02:09.384868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing eight_tick_forces_D3 unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RULER: What's the Real Context Size of Your Long-Context Language Models?

    cs.CL 2024-04 accept novelty 8.0

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  2. CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

    cs.CV 2026-05 unverdicted novelty 7.0

    CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

  3. Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

    cs.DC 2026-04 unverdicted novelty 7.0

    Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.

  4. Ring Attention with Blockwise Transformers for Near-Infinite Context

    cs.CL 2023-10 unverdicted novelty 7.0

    Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.

  5. ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference

    cs.DC 2026-05 unverdicted novelty 6.0

    ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.

  6. MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

    cs.DC 2026-05 unverdicted novelty 6.0

    MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

  7. Priming: Hybrid State Space Models From Pre-trained Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...

  8. Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

  9. CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training

    cs.LG 2026-04 unverdicted novelty 6.0

    CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.

  10. Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

    cs.AI 2026-04 unverdicted novelty 6.0

    Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and ...

  11. CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

    cs.DC 2026-04 unverdicted novelty 6.0

    CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.

  12. Geometric Context Transformer for Streaming 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...

  13. OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.

  14. LPM 1.0: Video-based Character Performance Model

    cs.CV 2026-04 unverdicted novelty 6.0

    LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.

  15. LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

    cs.CV 2026-04 conditional novelty 6.0

    LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

  16. DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

    cs.AR 2026-04 conditional novelty 6.0

    DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...

  17. GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

    cs.DC 2026-04 unverdicted novelty 6.0

    GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.

  18. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  19. An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

    cs.LG 2026-05 unverdicted novelty 5.0

    Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

  20. ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

    cs.DC 2026-05 unverdicted novelty 5.0

    ResiHP improves LLM training throughput by 1.04-4.39x under hardware failures by using a workload-aware execution time predictor to avoid false failure detections and a scheduler that dynamically changes parallelism g...

  21. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  22. SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

    cs.LG 2026-04 unverdicted novelty 5.0

    SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...

  23. ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

    cs.DC 2026-05 unverdicted novelty 4.0

    ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.

  24. Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

    cs.DC 2026-05 unverdicted novelty 4.0

    On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.

  25. Seedance 1.0: Exploring the Boundaries of Video Generation Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.

  26. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

174 extracted references · 174 canonical work pages · cited by 25 Pith papers · 17 internal anchors

  1. [1]

    Peters, and Arman Cohan

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020

  2. [2]

    Booksum: A collection of datasets for long-form narrative summarization, 2022

    Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization, 2022

  3. [3]

    Introducing mpt-7b: A new standard for open-source, commercially usable llms

    MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms. https://https://www.mosaicml.com/blog/mpt-7b, 2023

  4. [4]

    Effective long-context scaling of foundation models, 2023

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models, 2023

  5. [5]

    Yarn: Efficient context window extension of large language models, 2023

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models, 2023

  6. [6]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  7. [7]

    Gupta, and Aditya Grover

    Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K. Gupta, and Aditya Grover. Climax: A foundation model for weather and climate, 2023

  8. [8]

    Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics

    Maxim Zvyagin, Alexander Brace, Kyle Hippe, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, et al. Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics. bioRxiv, pages 2022--10, 2022

  9. [10]

    Todd Young, John Gounley, Noah Schaefferkoetter, Hong Jun Yoon, Xiao-Cheng Wu, Eric B

    Shang Gao, Mohammed Alawad, M. Todd Young, John Gounley, Noah Schaefferkoetter, Hong Jun Yoon, Xiao-Cheng Wu, Eric B. Durbin, Jennifer Doherty, Antoinette Stroup, Linda Coyle, and Georgia Tourassi. Limitations of transformers on clinical text classification. IEEE Journal of Biomedical and Health Informatics, 25 0 (9): 0 3596--3607, 2021. doi:10.1109/JBHI....

  10. [11]

    Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

  11. [12]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017

  12. [13]

    Large scale distributed deep networks

    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012

  13. [14]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

  14. [15]

    Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

    Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '21, 2021

  15. [17]

    Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

  16. [18]

    Pipedream: Generalized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In ACM Symposium on Operating Systems Principles (SOSP 2019), October 2019

  17. [19]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

    Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. ArXiv, abs/1811.06965, 2018

  18. [20]

    Memory-efficient pipeline-parallel dnn training

    Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937--7947. PMLR, 2021

  19. [21]

    DeepSpeed : Extreme-scale model training for everyone

    DeepSpeed Team and Rangan Majumder. DeepSpeed : Extreme-scale model training for everyone. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/, 2020

  20. [23]

    Zero++: Extremely efficient collective communication for giant model training, 2023

    Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He. Zero++: Extremely efficient collective communication for giant model training, 2023

  21. [24]

    Demystifying parallel and distributed deep learning: An in-depth concurrency analysis

    Tal Ben-Nun and Torsten Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52 0 (4): 0 1--43, 2019

  22. [25]

    Sequence parallelism: Long sequence training from system perspective, 2022 b

    Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective, 2022 b

  23. [26]

    Reducing activation recomputation in large transformer models, 2022

    Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022

  24. [27]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019

  25. [29]

    Big bird: Transformers for longer sequences, 2021

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021

  26. [30]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

  27. [31]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  28. [32]

    Zur Elektrodynamik bewegter K \"o rper

    Albert Einstein. Zur Elektrodynamik bewegter K \"o rper . ( German ) [ On the electrodynamics of moving bodies]. Annalen der Physik. 1905. doi:http://dx.doi.org/10.1002/andp.19053221004

  29. [33]

    2023 , eprint=

    Effective Long-Context Scaling of Foundation Models , author=. 2023 , eprint=

  30. [34]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  31. [35]

    2023 , eprint=

    ZeRO++: Extremely Efficient Collective Communication for Giant Model Training , author=. 2023 , eprint=

  32. [36]

    2023 , eprint=

    YaRN: Efficient Context Window Extension of Large Language Models , author=. 2023 , eprint=

  33. [37]

    Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , booktitle =

    Ritchie Zhao and Yuwei Hu and Jordan Dotzel and Christopher De Sa and Zhiru Zhang , editor =. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , booktitle =. 2019 , url =

  34. [38]

    The Tenth International Conference on Learning Representations,

    Tim Dettmers and Mike Lewis and Sam Shleifer and Luke Zettlemoyer , title =. The Tenth International Conference on Learning Representations,. 2022 , url =

  35. [40]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

    Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , title =. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =. 2020 , isbn =

  36. [41]

    Rethinking Attention with Performers

    Krzysztof Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Andreea Gane and Tam. Rethinking Attention with Performers , journal =. 2020 , url =. 2009.14794 , timestamp =

  37. [42]

    2021 , eprint=

    Big Bird: Transformers for Longer Sequences , author=. 2021 , eprint=

  38. [43]

    2020 , eprint=

    Longformer: The Long-Document Transformer , author=. 2020 , eprint=

  39. [44]

    2022 , eprint=

    BookSum: A Collection of Datasets for Long-form Narrative Summarization , author=. 2022 , eprint=

  40. [45]

    2022 , eprint=

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

  41. [46]

    2023 , eprint=

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. 2023 , eprint=

  42. [47]

    2022 , eprint=

    Sequence Parallelism: Long Sequence Training from System Perspective , author=. 2022 , eprint=

  43. [48]

    2023 , eprint=

    ClimaX: A foundation model for weather and climate , author=. 2023 , eprint=

  44. [49]

    2022 , eprint=

    Reducing Activation Recomputation in Large Transformer Models , author=. 2022 , eprint=

  45. [50]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

  46. [51]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

  47. [52]

    PipeDream: Fast and Efficient Pipeline Parallel DNN Training

    Pipedream: Fast and efficient pipeline parallel dnn training , author=. arXiv preprint arXiv:1806.03377 , year=

  48. [53]

    Advances in neural information processing systems , volume=

    Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , volume=

  49. [54]

    International Conference on Machine Learning , pages=

    Memory-efficient pipeline-parallel dnn training , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  50. [55]

    ACM Computing Surveys (CSUR) , volume=

    Demystifying parallel and distributed deep learning: An in-depth concurrency analysis , author=. ACM Computing Surveys (CSUR) , volume=. 2019 , publisher=

  51. [56]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    On large-batch training for deep learning: Generalization gap and sharp minima , author=. arXiv preprint arXiv:1609.04836 , year=

  52. [57]

    Advances in neural information processing systems , volume=

    Large scale distributed deep networks , author=. Advances in neural information processing systems , volume=

  53. [58]

    arXiv preprint arXiv:2102.02888 , year=

    Hanlin Tang and Shaoduo Gan and Ammar Ahmad Awan and Samyam Rajbhandari and Conglong Li and Xiangru Lian and Ji Liu and Ce Zhang and Yuxiong He , title =. CoRR , volume =. 2021 , url =. 2102.02888 , timestamp =

  54. [59]

    arXiv preprint arXiv:2104.06069 , year=

    Conglong Li and Ammar Ahmad Awan and Hanlin Tang and Samyam Rajbhandari and Yuxiong He , title =. CoRR , volume =. 2021 , url =. 2104.06069 , timestamp =

  55. [60]

    Proceedings of the Workshop on Machine Learning in High Performance Computing Environments , pages =

    Dryden, Nikoli and Moon, Tim and Jacobs, Sam Ade and Van Essen, Brian , title =. Proceedings of the Workshop on Machine Learning in High Performance Computing Environments , pages =. 2016 , isbn =

  56. [61]

    Fifteenth annual conference of the international speech communication association , year=

    1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns , author=. Fifteenth annual conference of the international speech communication association , year=

  57. [62]

    Todd and Gounley, John and Schaefferkoetter, Noah and Yoon, Hong Jun and Wu, Xiao-Cheng and Durbin, Eric B

    Gao, Shang and Alawad, Mohammed and Young, M. Todd and Gounley, John and Schaefferkoetter, Noah and Yoon, Hong Jun and Wu, Xiao-Cheng and Durbin, Eric B. and Doherty, Jennifer and Stroup, Antoinette and Coyle, Linda and Tourassi, Georgia , journal=. Limitations of Transformers on Clinical Text Classification , year=

  58. [63]

    Wehbe and Faraz S

    Yikuan Li and Ramsey M. Wehbe and Faraz S. Ahmad and Hanyin Wang and Yuan Luo , title =. CoRR , volume =. 2022 , url =. 2201.11838 , timestamp =

  59. [64]

    Scalable distributed DNN training using commodity GPU cloud computing , author=

  60. [65]

    Advances in neural information processing systems , volume=

    QSGD: Communication-efficient SGD via gradient quantization and encoding , author=. Advances in neural information processing systems , volume=

  61. [66]

    The Journal of Machine Learning Research , volume=

    Quantized neural networks: Training neural networks with low precision weights and activations , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=

  62. [67]

    International conference on machine learning , pages=

    Deep learning with limited numerical precision , author=. International conference on machine learning , pages=. 2015 , organization=

  63. [68]

    8-Bit Approximations for Parallelism in Deep Learning

    8-bit approximations for parallelism in deep learning , author=. arXiv preprint arXiv:1511.04561 , year=

  64. [69]

    2017 , publisher=

    NVIDIA Collective Communications Library (NCCL) , author=. 2017 , publisher=

  65. [70]

    The International Journal of High Performance Computing Applications , volume=

    Optimization of collective communication operations in MPICH , author=. The International Journal of High Performance Computing Applications , volume=. 2005 , publisher=

  66. [71]

    Concurrency and Computation: Practice and Experience , volume=

    Collective communication: theory, practice, and experience , author=. Concurrency and Computation: Practice and Experience , volume=. 2007 , publisher=

  67. [72]

    2022 , eprint=

    MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud , author=. 2022 , eprint=

  68. [73]

    2022 , eprint=

    Datasheet for the Pile , author=. 2022 , eprint=

  69. [74]

    Interspeech 2014 , year =

    Seide, Frank and Fu, Hao and Droppo, Jasha and Li, Gang and Yu, Dong , title =. Interspeech 2014 , year =

  70. [75]

    NVIDIA InfiniBand Adaptive Routing Technology , howpublished =

  71. [76]

    NVIDIA TESLA V100 GPU ACCELERATOR , howpublished =

  72. [77]

    Quantization - PyTorch documentation , howpublished =

  73. [78]

    Devanur and Ion Stoica , editor =

    Guanhua Wang and Shivaram Venkataraman and Amar Phanishayee and Jorgen Thelin and Nikhil R. Devanur and Ion Stoica , editor =. Blink: Fast and Generic Collectives for Distributed. Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020 , publisher =. 2020 , url =

  74. [79]

    2020 , howpublished =

    DeepSpeed Team and Rangan Majumder , title =. 2020 , howpublished =

  75. [80]

    ICPP , year=

    Scans as Primitive Parallel Operations , author=. ICPP , year=

  76. [81]

    XLA: Optimizing Compiler for Machine Learning , url =

  77. [82]

    NVIDIA CUTLASS , url =

  78. [83]

    13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) , pages=

    \ TVM \ : An Automated \ End-to-End \ Optimizing Compiler for Deep Learning , author=. 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) , pages=

  79. [84]

    Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , year=

  80. [85]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , author=. arXiv preprint arXiv:2201.11990 , year=

Showing first 80 references.