TrainMover: An Interruption-Resilient Runtime for ML Training
Pith reviewed 2026-05-23 07:10 UTC · model grok-4.3
The pith
TrainMover recovers large-scale ML training from interruptions in about 20 seconds using standby machines and zero memory overhead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TrainMover achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale by combining elastic and standby machines with two-phase delta-based communication group setup, communication-free sandboxed warmup, and general standby design that supports recovery from any role, projecting a 55 percent reduction in wasted GPU hours at larger scales.
What carries the argument
General standby design that enables failure recovery from any role, backed by two-phase delta-based communication group setup and communication-free sandboxed warmup.
If this is right
- Interruptions at 1024-GPU scale incur only around 20 seconds of downtime.
- Wasted GPU hours drop by 55 percent relative to the best prior alternative.
- At 64K-GPU scale the system saves 1.4 million GPU-hours per week.
- No extra memory is required on the active training nodes.
- The same standby pool works for hardware failures, software anomalies, and management events.
Where Pith is reading between the lines
- If clusters routinely provision extra standby capacity, operators could shift from over-provisioning for worst-case restarts to steady-state elastic pools.
- The communication-free warmup technique could reduce coordination overhead in other distributed workloads that need fast node replacement.
- At scales beyond 64K GPUs the weekly savings would grow linearly with cluster size if the per-interruption cost remains constant.
- Real-world traces that include rarer or correlated failures would test whether the 20-second bound holds outside the paper's controlled experiments.
Load-bearing premise
The design assumes elastic and standby machines are always available in the cluster and that real interruptions match the types and rates tested at 1024 GPUs.
What would settle it
A run at 1024 GPUs where TrainMover produces measured downtime well above 20 seconds on any of the interruption types evaluated in the paper would falsify the downtime claim.
Figures
read the original abstract
Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpoint-restart or runtime reconfiguration suffer from long downtimes and degraded performance. We present TrainMover, a resilient LLM training runtime that leverages elastic and standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces three key techniques: two-phase, delta-based communication group setup; communication-free sandboxed warmup; and general standby design that enables failure recovery from any role. Our evaluation shows that TrainMover consistently achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale. TrainMover is projected to reduce wasted GPU hours by 55% compared to the best alternative, saving 1.4 million GPU-hours per week at the 64K-GPU scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents TrainMover, a runtime for large-scale LLM training that handles hardware/software interruptions using elastic and standby machines. It introduces three techniques—two-phase delta-based communication group setup, communication-free sandboxed warmup, and general standby design for any-role recovery—to achieve minimal downtime and zero memory overhead. Evaluation claims consistent ~20s downtime at 1024-GPU scale for various interruptions, with a projection of 55% reduction in wasted GPU hours (saving 1.4M GPU-hours/week) at 64K-GPU scale compared to the best alternative.
Significance. If the empirical results and scaling projection hold, TrainMover could meaningfully reduce compute waste in production-scale training clusters by shortening recovery times from interruptions. The work's strength lies in its implemented system and concrete measurements rather than purely theoretical claims; however, the large-scale projection is central to its practical impact.
major comments (2)
- [Evaluation] Evaluation section: the 55% reduction and 1.4 million GPU-hour weekly savings at 64K-GPU scale are extrapolated from 1024-GPU experiments. No scaling curves, runs at intermediate or target scale, or analytic bounds on two-phase group setup / all-reduce / broadcast latency versus communicator size are provided, so the linear-extrapolation assumption for coordination costs remains untested and load-bearing for the headline claim.
- [Abstract] Abstract and Evaluation: concrete numbers (~20s downtime, 55% savings) are stated without visible experimental details on baselines, number of trials, variance, or error bars. This makes it impossible to assess whether the 20s figure is robust or sensitive to the tested interruption types/frequencies.
minor comments (2)
- [Design] The design assumes ready availability of elastic and standby machines; this premise should be explicitly stated as a scope limitation or validated under realistic cluster policies.
- [Techniques] Notation for the two-phase delta-based setup and sandboxed warmup could be clarified with a small diagram or pseudocode to aid reproducibility.
Simulated Author's Rebuttal
Thank you for the constructive feedback on the evaluation and presentation of results. We address each major comment below and outline revisions to strengthen the manuscript's transparency regarding experimental details and scaling assumptions.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the 55% reduction and 1.4 million GPU-hour weekly savings at 64K-GPU scale are extrapolated from 1024-GPU experiments. No scaling curves, runs at intermediate or target scale, or analytic bounds on two-phase group setup / all-reduce / broadcast latency versus communicator size are provided, so the linear-extrapolation assumption for coordination costs remains untested and load-bearing for the headline claim.
Authors: We agree that the 55% reduction and associated savings projection at 64K-GPU scale relies on extrapolation from 1024-GPU results without intermediate scaling data or explicit analytic bounds on the two-phase group setup and collective operation latencies. This assumption is indeed load-bearing for the headline impact claim. In the revised manuscript, we will add a new subsection discussing the scaling properties of the delta-based communication setup, including qualitative analysis and any available models for how coordination overhead grows with communicator size, along with explicit statements of the linear extrapolation assumptions used for the GPU-hour savings estimate. We note that full-scale experiments at 64K GPUs are not feasible with our resources. revision: partial
-
Referee: [Abstract] Abstract and Evaluation: concrete numbers (~20s downtime, 55% savings) are stated without visible experimental details on baselines, number of trials, variance, or error bars. This makes it impossible to assess whether the 20s figure is robust or sensitive to the tested interruption types/frequencies.
Authors: The Evaluation section details the baselines, interruption types, and scenarios used to obtain the ~20s downtime and 55% savings figures. To improve accessibility and allow assessment of robustness, we will revise both the abstract and Evaluation section to explicitly report the number of trials, include variance or error bars for the downtime measurements, and clarify sensitivity to interruption types. This will make the experimental support for the stated numbers more transparent without altering the core claims. revision: yes
Circularity Check
No circularity; empirical system evaluation with extrapolation
full rationale
The paper presents an implemented runtime (TrainMover) with three described techniques and reports direct empirical measurements of ~20s downtime at 1024-GPU scale. The 55% savings and 1.4M GPU-hour projection at 64K scale is an extrapolation from those measurements, but no derivation chain, equations, fitted parameters, or self-citations are shown that reduce the claims to inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text. The work is self-contained as an engineering artifact whose claims rest on implementation and testing rather than any circular mathematical reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Elastic and standby machines are available on demand in the target cluster environment
Reference graph
Works this paper leans on
-
[1]
Balance of power: A full-stack approach to power and thermal fluctuations in ML infrastruc- ture. https://cloud.google.com/blog/topics/systems/ mitigating-power-and-thermal-fluctuations-in-ml-infrastructure , 2024
work page 2024
-
[2]
Maintaining large-scale AI capacity at Meta. https: //engineering.fb.com/2024/06/12/production-engineering/ maintaining-large-scale-ai-capacity-meta/ , 2024
work page 2024
-
[3]
https://github.com/NVIDIA/ Megatron-LM, 2024
Megatron-LM Github Repository. https://github.com/NVIDIA/ Megatron-LM, 2024
work page 2024
-
[4]
Amazon EC2 Capacity Blocks for ML pricing. https://aws.amazon. com/ec2/capacityblocks/pricing/, 2025
work page 2025
-
[5]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[6]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[7]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page 2022
-
[8]
Minder: Faulty machine detection for large-scale distributed model training, 2024
Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, and Minlan Yu. Minder: Faulty machine detection for large-scale distributed model training, 2024
work page 2024
-
[9]
Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yikai Zhu, Gang Lu, Zhihui Ren, Xue Li, et al. Evolution of aegis: Fault diagnosis for ai model training cloud service in production (experience track)
-
[10]
Parcae: Proactive, Liveput- Optimized DNN training on preemptible instances
Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry Xu, Minjia Zhang, and Zhihao Jia. Parcae: Proactive, Liveput- Optimized DNN training on preemptible instances. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1121–1139, Santa Clara, CA, April 2024. USENIX Association
work page 2024
-
[11]
Check-N-Run: a checkpoint- ing system for training deep learning recommendation models
Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. Check-N-Run: a checkpoint- ing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implemen- tation (NSDI 22) , pages 929–943, Renton, WA,...
work page 2022
-
[12]
Recycle: Resilient training of large dnns using pipeline adaptation
Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, and Christos Kozyrakis. Recycle: Resilient training of large dnns using pipeline adaptation. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 211–228, New York, NY, USA, 2024. Association for Computing Machinery
work page 2024
-
[13]
En- abling parallelism hot switching for efficient training of large language models
Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. En- abling parallelism hot switching for efficient training of large language models. In Proceedings of the ACM SIGOPS 30th Symposium on Operat- ing Systems Principles, SOSP ’24, page 178–194, New York, NY, USA,
-
[14]
Association for Computing Machinery
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Au- relien Rodriguez, Austen Gregerson, A...
work page 2024
-
[16]
Just-in-time checkpointing: Low cost error re- covery from deep learning training failures
Tanmaey Gupta, Sanjeev Krishnan, Rituraj Kumar, Abhishek Vi- jeev, Bhargav Gulavani, Nipun Kwatra, Ramachandran Ramjee, and Muthian Sivathanu. Just-in-time checkpointing: Low cost error re- covery from deep learning training failures. In Proceedings of the Nineteenth European Conference on Computer Systems , EuroSys ’24, page 1110–1125, New York, NY, USA,...
work page 2024
-
[17]
Oobleck: Resilient distributed training of large models using pipeline templates
Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowd- hury. Oobleck: Resilient distributed training of large models using pipeline templates. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 382–395, New York, NY, USA, 2023. Association for Computing Machinery
work page 2023
-
[18]
MegaScale: Scaling large language model training to more than 10,000 GPUs
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...
work page 2024
-
[19]
Revisiting reliability in large-scale machine learning research clusters, 2024
Apostolos Kokolis, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zachary DeVito, Shubho Sengupta, Kalyan Sal- adi, and Carole-Jean Wu. Revisiting reliability in large-scale machine learning research clusters, 2024
work page 2024
-
[20]
Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, and Jonathan M. Cohen. Nemo: a toolkit for building ai appli- cations using neural modules, 2019
work page 2019
-
[21]
A case for server-scale photonic connectivity
Abhishek Vijaya Kumar, Arjun Devraj, Darius Bunandar, and Rachee Singh. A case for server-scale photonic connectivity. In Proceedings of the 23rd ACM Workshop on Hot Topics in Networks , HotNets ’24, 14 page 290–299, New York, NY, USA, 2024. Association for Computing Machinery
work page 2024
-
[22]
Colossal-ai: A unified deep learning system for large-scale parallel training
Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing , ICPP ’23, page 766–775, New York, NY, USA, 2023. Association for Computing Machinery
work page 2023
-
[23]
Shengwei Li, Zhiquan Lai, Yanqi Hao, Weijie Liu, Keshi Ge, Xiaoge Deng, Dongsheng Li, and Kai Lu. Automated tensor model paral- lelism with overlapped communication for efficient foundation model training, 2023
work page 2023
-
[24]
Pointer sentinel mixture models, 2016
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016
work page 2016
-
[25]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Prin- ciples, SOSP ’19, page 1–15, New York, NY, USA, 2019. Association for Computing Machinery
work page 2019
-
[26]
Efficient large-scale language model training on gpu clusters using megatron-lm
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Perf...
work page 2021
-
[27]
Deep Learning Recommendation Model for Personalization and Recommendation Systems
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. Deep learning recom- mendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[28]
Alibaba hpn: A data center network for large language model training
Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, and Dennis Cai. Alibaba hpn: A data center network for large language model training. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, ...
work page 2024
-
[29]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018
work page 2018
-
[30]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[31]
Zero: memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , SC ’20. IEEE Press, 2020
work page 2020
-
[32]
Zero-infinity: breaking the gpu memory wall for extreme scale deep learning
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , SC ’21, New York, NY, USA, 2021. Association for Computing Machinery
work page 2021
-
[33]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning mod- els with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Min- ing, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery
work page 2020
-
[34]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi- billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[35]
Bamboo: Making preemptible instances resilient for affordable training of large DNNs
John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. Bamboo: Making preemptible instances resilient for affordable training of large DNNs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) , pages 497–513, Boston, MA, April 2023. USENIX Association
work page 2023
-
[36]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Large-scale cluster management at google with borg
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppen- heimer, Eric Tune, and John Wilkes. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, New York, NY, USA, 2015. Association for Computing Machinery
work page 2015
-
[39]
Bytecheckpoint: A unified checkpointing system for large foundation model development, 2024
Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for large foundation model development, 2024
work page 2024
-
[40]
Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eu- gene Ng, and Yida Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the 29th Sym- posium on Operating Systems Principles , SOSP ’23, page 364–381, New York, NY, USA, 2023. Association for Computing Machinery
work page 2023
-
[41]
Falcon: Pinpointing and mitigating stragglers for large-scale hybrid- parallel training, 2024
Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang. Falcon: Pinpointing and mitigating stragglers for large-scale hybrid- parallel training, 2024
work page 2024
-
[42]
Mccs: A service-based approach to collective communication for multi-tenant cloud
Yongji Wu, Yechen Xu, Jingrong Chen, Zhaodong Wang, Ying Zhang, Matthew Lentz, and Danyang Zhuo. Mccs: A service-based approach to collective communication for multi-tenant cloud. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 679–690, New York, NY, USA, 2024. Association for Computing Machinery
work page 2024
-
[43]
Can’t be late: Optimizing spot instance savings under deadlines
Zhanghao Wu, Wei-Lin Chiang, Ziming Mao, Zongheng Yang, Eric Friedman, Scott Shenker, and Ion Stoica. Can’t be late: Optimizing spot instance savings under deadlines. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) , pages 185–203, Santa Clara, CA, April 2024. USENIX Association
work page 2024
-
[44]
Adapcc: Making collective communication in distributed machine learning adaptive
Xiaoyang Zhao, Zhe Zhang, and Chuan Wu. Adapcc: Making collective communication in distributed machine learning adaptive. In 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), pages 25–35, 2024. 15 460 480 500 520 540 560 580 600 620 0 50 100 150 200Rx Network Traffic (Gbps) States Transfer Network Usage Over Time 460 480 50...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.