pith. sign in

arxiv: 2412.12636 · v3 · pith:2OKGTL5Hnew · submitted 2024-12-17 · 💻 cs.DC · cs.AI· cs.LG· cs.PF

TrainMover: An Interruption-Resilient Runtime for ML Training

Pith reviewed 2026-05-23 07:10 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.PF
keywords ML training runtimeinterruption resilienceelastic computingstandby machinesGPU cluster fault tolerancecommunication group setupLLM training
0
0 comments X

The pith

TrainMover recovers large-scale ML training from interruptions in about 20 seconds using standby machines and zero memory overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large ML training jobs on GPU clusters face frequent interruptions from hardware failures and management events that force long restarts with existing checkpoint or reconfiguration methods. TrainMover keeps training alive by drawing on a pool of elastic and standby machines that take over quickly when nodes drop out. The system sets up communication groups in two phases using only delta changes, warms up new nodes without any cross-node messages, and lets any standby assume any role after a failure. At 1024 GPUs the measured downtime stays near 20 seconds across tested interruption types, and scaling projections show a 55 percent drop in wasted GPU hours that reaches 1.4 million hours saved per week at 64K GPUs.

Core claim

TrainMover achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale by combining elastic and standby machines with two-phase delta-based communication group setup, communication-free sandboxed warmup, and general standby design that supports recovery from any role, projecting a 55 percent reduction in wasted GPU hours at larger scales.

What carries the argument

General standby design that enables failure recovery from any role, backed by two-phase delta-based communication group setup and communication-free sandboxed warmup.

If this is right

  • Interruptions at 1024-GPU scale incur only around 20 seconds of downtime.
  • Wasted GPU hours drop by 55 percent relative to the best prior alternative.
  • At 64K-GPU scale the system saves 1.4 million GPU-hours per week.
  • No extra memory is required on the active training nodes.
  • The same standby pool works for hardware failures, software anomalies, and management events.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If clusters routinely provision extra standby capacity, operators could shift from over-provisioning for worst-case restarts to steady-state elastic pools.
  • The communication-free warmup technique could reduce coordination overhead in other distributed workloads that need fast node replacement.
  • At scales beyond 64K GPUs the weekly savings would grow linearly with cluster size if the per-interruption cost remains constant.
  • Real-world traces that include rarer or correlated failures would test whether the 20-second bound holds outside the paper's controlled experiments.

Load-bearing premise

The design assumes elastic and standby machines are always available in the cluster and that real interruptions match the types and rates tested at 1024 GPUs.

What would settle it

A run at 1024 GPUs where TrainMover produces measured downtime well above 20 seconds on any of the interruption types evaluated in the paper would falsify the downtime claim.

Figures

Figures reproduced from arXiv: 2412.12636 by Aditya Akella, ChonLam Lao, Dennis Cai, Ennan Zhai, Jiamin Cao, Jiangfei Duan, Jiaqi Gao, Jingren Zhou, Minlan Yu, Pengcheng Zhang, Yichi Xu, Yong Li, Yu Guan, Zhengping Qian, Zhilong Zheng, Zhipeng Zhang.

Figure 2
Figure 2. Figure 2: GPU memory utilization across machine scales • Network failures including optical module or switch failures, and network congestion. Network architects pro￾vide enough redundancy in the network [27] to guaran￾tee connectivity between the GPU servers. However, a failed network component still slows down the training job. Nearly half of the large-scale LLM training jobs are slowed down by network events such… view at source ↗
Figure 4
Figure 4. Figure 4: Time breakdown of NCCL setup components with and without the CUDA_VISIBLE_DEVICES flag. in Oobleck and 21 seconds in Parcae. Adding a new machine in these systems (+1) incurs similar overhead as the restart phase of stop-reschedule-restart schemes: the new machine must initialize NCCL groups and all the system components across the software and hardware stack. This takes over 100 seconds in Oobleck and mor… view at source ↗
Figure 5
Figure 5. Figure 5: CCL Migration workflow subsequently enters the main loop, it can seamlessly continue training without incurring downtime overhead. 5.1 Today’s Lazy Initialization Lazy initialization is a critical aspect of distributed training that helps boost performance through runtime optimiza￾tion, especially in Python-based frameworks like PyTorch. Deferred until the first forward pass, the system lazily con￾structs … view at source ↗
Figure 6
Figure 6. Figure 6: Sandbox lazy initialization workflow The sandbox operates in two phases: record and replay. The record phase captures one (or few) valid iteration and stores the outputs of communication operations. When mi￾gration occurs, joiners are placed into the sandbox, where the replay phase replays the valid communication results recorded from the record phase to trigger their lazy initial￾ization. Shield the Migra… view at source ↗
Figure 7
Figure 7. Figure 7: Machine downtime with different models and parallel settings Baseline. Our primary baseline is Megatron-LM, which in￾cludes built-in checkpointing for saving and loading models. We consider the following checkpointing approaches: Megatron-LM Per-iteration: A per-iteration checkpointing system that assumes checkpoint saving is cost-free and can always be overlapped within a single iteration [39]. It also as… view at source ↗
Figure 8
Figure 8. Figure 8: Downtime with different bandwidth per GPU all machines to reboot and retrieve checkpoints from remote storage, causing 138 seconds overhead. Migration at Scale. Figures 10a and 10b evaluate Train￾Mover’s performance on 3 to 8 AWS p4d.24xlarge instances during a training job. In each run, one machine is migrated to a new instance. Given the 40 GB memory of AWS p4d.24xlarge GPUs—half the capacity of our prim… view at source ↗
Figure 9
Figure 9. Figure 9: Downtime varying different migra￾tion scale 3 4 5 6 7 8 Total Machines 0 25 50 75 100 125 150 175 200 Downtime (s) TrainMover Megatron-LM (a) Without flag 3 4 5 6 7 8 Total Machines 0 25 50 75 100 125 150 175 200 Downtime (s) TrainMover Megatron-LM (b) With flag [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: GPT-20B model with 20% slowdown starting at the 75th iteration 0 20 40 60 80 100 Straggler Occurs at Iteration 0.5 0.6 0.7 0.8 0.9 Normalized Training Efficiency [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Straggler oc￾curs at different iteration 39.1B Dist 20B Dist 2.7B Medium 0 50 100 150 200 250 Downtime (s) Oobleck, Parcae Unsupported TrainMover (w/ standby, hit) TrainMover (w/ standby, miss) TrainMover (w/o standby) Megatron-LM Oobleck Parcae [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Network traffic and memory usage timeline during Migration A DETAIL MODEL PROFILE SETTING FOR EX￾PERIMENTS Models are tested with various tensor parallel (TP), data par￾allel (DP), and pipeline parallel (PP) configurations, including (TP1, PP8, DP3), (TP4, PP8, DP3), and (TP8, PP8, DP3). De￾fault profiles are: GPT-Medium and GPT-2.7B use TP1, PP8, DP3, a global batch size of 96, and a microbatch size of 2… view at source ↗
Figure 16
Figure 16. Figure 16: time and memory cost timeline for different NCCL design decision from 71GB to 77GB, because many new NCCL groups (e.g., DP/TP/PP groups) must be initialized, and there are two sets of groups (frontend and backend) existing simultaneously. TrainMover’s NCCL design achieves zero memory over￾head and only a small downtime around the 10th iteration, when migration completes. This is because the second stage o… view at source ↗
read the original abstract

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpoint-restart or runtime reconfiguration suffer from long downtimes and degraded performance. We present TrainMover, a resilient LLM training runtime that leverages elastic and standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces three key techniques: two-phase, delta-based communication group setup; communication-free sandboxed warmup; and general standby design that enables failure recovery from any role. Our evaluation shows that TrainMover consistently achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale. TrainMover is projected to reduce wasted GPU hours by 55% compared to the best alternative, saving 1.4 million GPU-hours per week at the 64K-GPU scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents TrainMover, a runtime for large-scale LLM training that handles hardware/software interruptions using elastic and standby machines. It introduces three techniques—two-phase delta-based communication group setup, communication-free sandboxed warmup, and general standby design for any-role recovery—to achieve minimal downtime and zero memory overhead. Evaluation claims consistent ~20s downtime at 1024-GPU scale for various interruptions, with a projection of 55% reduction in wasted GPU hours (saving 1.4M GPU-hours/week) at 64K-GPU scale compared to the best alternative.

Significance. If the empirical results and scaling projection hold, TrainMover could meaningfully reduce compute waste in production-scale training clusters by shortening recovery times from interruptions. The work's strength lies in its implemented system and concrete measurements rather than purely theoretical claims; however, the large-scale projection is central to its practical impact.

major comments (2)
  1. [Evaluation] Evaluation section: the 55% reduction and 1.4 million GPU-hour weekly savings at 64K-GPU scale are extrapolated from 1024-GPU experiments. No scaling curves, runs at intermediate or target scale, or analytic bounds on two-phase group setup / all-reduce / broadcast latency versus communicator size are provided, so the linear-extrapolation assumption for coordination costs remains untested and load-bearing for the headline claim.
  2. [Abstract] Abstract and Evaluation: concrete numbers (~20s downtime, 55% savings) are stated without visible experimental details on baselines, number of trials, variance, or error bars. This makes it impossible to assess whether the 20s figure is robust or sensitive to the tested interruption types/frequencies.
minor comments (2)
  1. [Design] The design assumes ready availability of elastic and standby machines; this premise should be explicitly stated as a scope limitation or validated under realistic cluster policies.
  2. [Techniques] Notation for the two-phase delta-based setup and sandboxed warmup could be clarified with a small diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on the evaluation and presentation of results. We address each major comment below and outline revisions to strengthen the manuscript's transparency regarding experimental details and scaling assumptions.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the 55% reduction and 1.4 million GPU-hour weekly savings at 64K-GPU scale are extrapolated from 1024-GPU experiments. No scaling curves, runs at intermediate or target scale, or analytic bounds on two-phase group setup / all-reduce / broadcast latency versus communicator size are provided, so the linear-extrapolation assumption for coordination costs remains untested and load-bearing for the headline claim.

    Authors: We agree that the 55% reduction and associated savings projection at 64K-GPU scale relies on extrapolation from 1024-GPU results without intermediate scaling data or explicit analytic bounds on the two-phase group setup and collective operation latencies. This assumption is indeed load-bearing for the headline impact claim. In the revised manuscript, we will add a new subsection discussing the scaling properties of the delta-based communication setup, including qualitative analysis and any available models for how coordination overhead grows with communicator size, along with explicit statements of the linear extrapolation assumptions used for the GPU-hour savings estimate. We note that full-scale experiments at 64K GPUs are not feasible with our resources. revision: partial

  2. Referee: [Abstract] Abstract and Evaluation: concrete numbers (~20s downtime, 55% savings) are stated without visible experimental details on baselines, number of trials, variance, or error bars. This makes it impossible to assess whether the 20s figure is robust or sensitive to the tested interruption types/frequencies.

    Authors: The Evaluation section details the baselines, interruption types, and scenarios used to obtain the ~20s downtime and 55% savings figures. To improve accessibility and allow assessment of robustness, we will revise both the abstract and Evaluation section to explicitly report the number of trials, include variance or error bars for the downtime measurements, and clarify sensitivity to interruption types. This will make the experimental support for the stated numbers more transparent without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation with extrapolation

full rationale

The paper presents an implemented runtime (TrainMover) with three described techniques and reports direct empirical measurements of ~20s downtime at 1024-GPU scale. The 55% savings and 1.4M GPU-hour projection at 64K scale is an extrapolation from those measurements, but no derivation chain, equations, fitted parameters, or self-citations are shown that reduce the claims to inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text. The work is self-contained as an engineering artifact whose claims rest on implementation and testing rather than any circular mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Systems paper with no mathematical free parameters or invented entities; relies on standard domain assumptions about cluster resources.

axioms (1)
  • domain assumption Elastic and standby machines are available on demand in the target cluster environment
    Central to the zero-overhead recovery design and the 20-second claim.

pith-pipeline@v0.9.0 · 5734 in / 1117 out tokens · 56995 ms · 2026-05-23T07:10:29.282546+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

  1. [1]

    https://cloud.google.com/blog/topics/systems/ mitigating-power-and-thermal-fluctuations-in-ml-infrastructure , 2024

    Balance of power: A full-stack approach to power and thermal fluctuations in ML infrastruc- ture. https://cloud.google.com/blog/topics/systems/ mitigating-power-and-thermal-fluctuations-in-ml-infrastructure , 2024

  2. [2]

    https: //engineering.fb.com/2024/06/12/production-engineering/ maintaining-large-scale-ai-capacity-meta/ , 2024

    Maintaining large-scale AI capacity at Meta. https: //engineering.fb.com/2024/06/12/production-engineering/ maintaining-large-scale-ai-capacity-meta/ , 2024

  3. [3]

    https://github.com/NVIDIA/ Megatron-LM, 2024

    Megatron-LM Github Repository. https://github.com/NVIDIA/ Megatron-LM, 2024

  4. [4]

    https://aws.amazon

    Amazon EC2 Capacity Blocks for ML pricing. https://aws.amazon. com/ec2/capacityblocks/pricing/, 2025

  5. [5]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  6. [6]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  7. [7]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  8. [8]

    Minder: Faulty machine detection for large-scale distributed model training, 2024

    Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, and Minlan Yu. Minder: Faulty machine detection for large-scale distributed model training, 2024

  9. [9]

    Evolution of aegis: Fault diagnosis for ai model training cloud service in production (experience track)

    Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yikai Zhu, Gang Lu, Zhihui Ren, Xue Li, et al. Evolution of aegis: Fault diagnosis for ai model training cloud service in production (experience track)

  10. [10]

    Parcae: Proactive, Liveput- Optimized DNN training on preemptible instances

    Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry Xu, Minjia Zhang, and Zhihao Jia. Parcae: Proactive, Liveput- Optimized DNN training on preemptible instances. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1121–1139, Santa Clara, CA, April 2024. USENIX Association

  11. [11]

    Check-N-Run: a checkpoint- ing system for training deep learning recommendation models

    Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. Check-N-Run: a checkpoint- ing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implemen- tation (NSDI 22) , pages 929–943, Renton, WA,...

  12. [12]

    Recycle: Resilient training of large dnns using pipeline adaptation

    Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, and Christos Kozyrakis. Recycle: Resilient training of large dnns using pipeline adaptation. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 211–228, New York, NY, USA, 2024. Association for Computing Machinery

  13. [13]

    En- abling parallelism hot switching for efficient training of large language models

    Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. En- abling parallelism hot switching for efficient training of large language models. In Proceedings of the ACM SIGOPS 30th Symposium on Operat- ing Systems Principles, SOSP ’24, page 178–194, New York, NY, USA,

  14. [14]

    Association for Computing Machinery

  15. [15]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Au- relien Rodriguez, Austen Gregerson, A...

  16. [16]

    Just-in-time checkpointing: Low cost error re- covery from deep learning training failures

    Tanmaey Gupta, Sanjeev Krishnan, Rituraj Kumar, Abhishek Vi- jeev, Bhargav Gulavani, Nipun Kwatra, Ramachandran Ramjee, and Muthian Sivathanu. Just-in-time checkpointing: Low cost error re- covery from deep learning training failures. In Proceedings of the Nineteenth European Conference on Computer Systems , EuroSys ’24, page 1110–1125, New York, NY, USA,...

  17. [17]

    Oobleck: Resilient distributed training of large models using pipeline templates

    Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowd- hury. Oobleck: Resilient distributed training of large models using pipeline templates. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 382–395, New York, NY, USA, 2023. Association for Computing Machinery

  18. [18]

    MegaScale: Scaling large language model training to more than 10,000 GPUs

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

  19. [19]

    Revisiting reliability in large-scale machine learning research clusters, 2024

    Apostolos Kokolis, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zachary DeVito, Shubho Sengupta, Kalyan Sal- adi, and Carole-Jean Wu. Revisiting reliability in large-scale machine learning research clusters, 2024

  20. [20]

    Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, and Jonathan M. Cohen. Nemo: a toolkit for building ai appli- cations using neural modules, 2019

  21. [21]

    A case for server-scale photonic connectivity

    Abhishek Vijaya Kumar, Arjun Devraj, Darius Bunandar, and Rachee Singh. A case for server-scale photonic connectivity. In Proceedings of the 23rd ACM Workshop on Hot Topics in Networks , HotNets ’24, 14 page 290–299, New York, NY, USA, 2024. Association for Computing Machinery

  22. [22]

    Colossal-ai: A unified deep learning system for large-scale parallel training

    Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing , ICPP ’23, page 766–775, New York, NY, USA, 2023. Association for Computing Machinery

  23. [23]

    Automated tensor model paral- lelism with overlapped communication for efficient foundation model training, 2023

    Shengwei Li, Zhiquan Lai, Yanqi Hao, Weijie Liu, Keshi Ge, Xiaoge Deng, Dongsheng Li, and Kai Lu. Automated tensor model paral- lelism with overlapped communication for efficient foundation model training, 2023

  24. [24]

    Pointer sentinel mixture models, 2016

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

  25. [25]

    Devanur, Gregory R

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Prin- ciples, SOSP ’19, page 1–15, New York, NY, USA, 2019. Association for Computing Machinery

  26. [26]

    Efficient large-scale language model training on gpu clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Perf...

  27. [27]

    Deep Learning Recommendation Model for Personalization and Recommendation Systems

    Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. Deep learning recom- mendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019

  28. [28]

    Alibaba hpn: A data center network for large language model training

    Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, and Dennis Cai. Alibaba hpn: A data center network for large language model training. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, ...

  29. [29]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

  30. [30]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  31. [31]

    Zero: memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , SC ’20. IEEE Press, 2020

  32. [32]

    Zero-infinity: breaking the gpu memory wall for extreme scale deep learning

    Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , SC ’21, New York, NY, USA, 2021. Association for Computing Machinery

  33. [33]

    Deepspeed: System optimizations enable training deep learning mod- els with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning mod- els with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Min- ing, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery

  34. [34]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi- billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  35. [35]

    Bamboo: Making preemptible instances resilient for affordable training of large DNNs

    John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. Bamboo: Making preemptible instances resilient for affordable training of large DNNs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) , pages 497–513, Boston, MA, April 2023. USENIX Association

  36. [36]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  37. [37]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  38. [38]

    Large-scale cluster management at google with borg

    Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppen- heimer, Eric Tune, and John Wilkes. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, New York, NY, USA, 2015. Association for Computing Machinery

  39. [39]

    Bytecheckpoint: A unified checkpointing system for large foundation model development, 2024

    Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for large foundation model development, 2024

  40. [40]

    Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eu- gene Ng, and Yida Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the 29th Sym- posium on Operating Systems Principles , SOSP ’23, page 364–381, New York, NY, USA, 2023. Association for Computing Machinery

  41. [41]

    Falcon: Pinpointing and mitigating stragglers for large-scale hybrid- parallel training, 2024

    Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang. Falcon: Pinpointing and mitigating stragglers for large-scale hybrid- parallel training, 2024

  42. [42]

    Mccs: A service-based approach to collective communication for multi-tenant cloud

    Yongji Wu, Yechen Xu, Jingrong Chen, Zhaodong Wang, Ying Zhang, Matthew Lentz, and Danyang Zhuo. Mccs: A service-based approach to collective communication for multi-tenant cloud. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 679–690, New York, NY, USA, 2024. Association for Computing Machinery

  43. [43]

    Can’t be late: Optimizing spot instance savings under deadlines

    Zhanghao Wu, Wei-Lin Chiang, Ziming Mao, Zongheng Yang, Eric Friedman, Scott Shenker, and Ion Stoica. Can’t be late: Optimizing spot instance savings under deadlines. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) , pages 185–203, Santa Clara, CA, April 2024. USENIX Association

  44. [44]

    Adapcc: Making collective communication in distributed machine learning adaptive

    Xiaoyang Zhao, Zhe Zhang, and Chuan Wu. Adapcc: Making collective communication in distributed machine learning adaptive. In 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), pages 25–35, 2024. 15 460 480 500 520 540 560 580 600 620 0 50 100 150 200Rx Network Traffic (Gbps) States Transfer Network Usage Over Time 460 480 50...