pith. sign in

arxiv: 2605.22014 · v1 · pith:EWLGIN7Rnew · submitted 2026-05-21 · 💻 cs.DC

LiveR: Fine-Grained Elasticity via Live Reconfiguration for Model Training

Pith reviewed 2026-05-22 04:23 UTC · model grok-4.3

classification 💻 cs.DC
keywords elastic traininglive reconfigurationLLM trainingdistributed systemsmodel parallelismvolatile resourcesGPU clusters
0
0 comments X

The pith

LiveR replaces checkpoint restarts with live model state handoff to enable fast elasticity in large model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large model training often runs on cheap but unstable GPU resources that can disappear with little notice, yet changing the number of workers currently requires stopping everything to save state and reload it. The paper shows that training can instead continue while a new configuration is prepared in the background and the model state is streamed directly to new workers. The state is adjusted on the fly to fit the new mix of parallelism settings before a quick switch happens. If this holds, jobs could use low-cost volatile capacity without losing most of their progress to long pauses.

Core claim

LiveR performs a live bounded-memory handoff between mixed-parallel training worlds for elastic LLM training. While the current configuration keeps training, the system asynchronously prepares the target world, bootstraps added workers in isolation, streams model state directly over high-bandwidth links, and reshapes it online across tensor, pipeline, and data parallel dimensions before a lightweight commit switches execution without stop-and-restart.

What carries the argument

The live handoff that streams and reshapes model state across parallel dimensions while the original training continues.

If this is right

  • Reconfiguration time falls from minutes to seconds.
  • Reconfiguration runs 14 to 23 times faster than checkpoint and restart methods.
  • Steady-state overhead stays low.
  • Training goodput reaches up to 99 percent under volatile resource conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same handoff idea could support dynamic scaling in other distributed workloads that move large state.
  • Cluster schedulers could trigger more frequent resource changes if live reconfiguration becomes reliable.
  • Lower-bandwidth networks might require extra buffering or compression to keep the approach viable.

Load-bearing premise

Model state can be streamed and reshaped to a new parallel setup without data corruption or loss of training correctness during the handoff.

What would settle it

Run repeated resource additions and removals during training and check whether each switch completes in seconds with no accuracy loss compared to a non-elastic run.

Figures

Figures reproduced from arXiv: 2605.22014 by Haoyuan Liu, Kairui Zhou, Qinwei Yang, Shengkai Lin, Shizhen Zhao, Shuyao Qi, Wei Zhang.

Figure 1
Figure 1. Figure 1: Sources of Resource Volatility. (a) Spot instance availability fluctuates rapidly; static jobs fail when capacity drops below requirements, while elastic jobs adapt. (b) Cluster schedulers leave fragmented idle resources; rigid gang-scheduled jobs cannot utilize them, whereas elastic jobs harvest scattered GPUs. 1 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reconfiguration sequence comparison. Stop-and-Restart (left) incurs minutes of downtime due to storage I/O and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LiveR architecture overview. The foreground plane executes the training loop using the Active World’s process groups; the background plane handles Shadow World initialization and Streaming Resharding state transfer. The Atomic Switch is a sub-second metadata swap; no second model copy is ever instantiated. asynchronously while iteration N+1 starts immediately on the new world. Steps 1–2 overlap entirely wi… view at source ↗
Figure 4
Figure 4. Figure 4: Generation state machine for safe world transitions. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Intersection-based transfer planning for TP reshard [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: End-to-end evaluation on the 32-GPU A800 testbed. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training efficiency across three volatility regimes [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cumulative wasted GPU-hours for a single training [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training loss and gradient norm trace the static [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Simulated reconfiguration time for a 70B param [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
read the original abstract

To reduce user costs and maximize cluster utilization, large model training increasingly leverages volatile but inexpensive GPU capacity, such as spot instances and reclaimable resources in shared clusters. Yet, capitalizing on these economic benefits requires jobs to adapt within the short warning windows that many such environments provide. Existing elastic training systems still treat reconfiguration as stop-and-restart: they externalize distributed state through checkpoints, rebuild the distributed runtime on a new topology, and restart training, turning each resize event into a storage-heavy recovery procedure that incurs substantial downtime from checkpoint I/O, process restart, CUDA initialization, and communicator setup. We present LiveR, a live reconfiguration runtime for elastic LLM training that replaces storage-backed restart with a live, bounded-memory handoff between mixed-parallel training worlds. While the current world continues training, LiveR asynchronously prepares the target world, bootstraps newly added workers in isolation to keep heavyweight initialization off the critical path, and streams model state directly over high-bandwidth interconnects while reshaping it online across tensor, pipeline, and data parallel dimensions. Once the target world is ready, LiveR performs a lightweight commit that switches training to the new configuration without stop-and-restart on the live path. We implement LiveR atop Megatron-LM and PyTorch and evaluate it end-to-end on a multi-node GPU cluster. Across diverse reconfiguration scenarios, LiveR reduces downtime from minutes to seconds, accelerates reconfiguration by 14$\times$-23$\times$ over checkpoint/restart baselines, incurs minimal steady-state overhead, and sustains up to 99% training goodput under volatile-resource conditions, making volatile low-cost GPU capacity far more practical for LLM training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents LiveR, a live reconfiguration runtime for elastic LLM training on volatile GPU resources. It enables fine-grained elasticity by performing a live handoff of model state between mixed-parallel training worlds without stop-and-restart, using asynchronous preparation, isolated bootstrapping of new workers, and online streaming and reshaping of state across tensor, pipeline, and data parallel dimensions. End-to-end evaluation on a multi-node cluster demonstrates downtime reduction from minutes to seconds, 14×-23× acceleration over checkpoint/restart, minimal steady-state overhead, and up to 99% training goodput.

Significance. If the central claims hold, this work has high significance for distributed systems and ML training, as it addresses a key barrier to using cost-effective but volatile resources like spot instances for large-scale model training. The provision of an end-to-end implementation atop Megatron-LM and PyTorch with multi-node evaluation and quantified performance improvements strengthens the contribution.

major comments (1)
  1. [Abstract] Abstract: The description of the live handoff mechanism lacks any mention of data integrity checks, such as checksums or validation steps after the commit, which is essential to substantiate the claim that the handoff preserves exact weights, optimizer state, and computation semantics without corruption or divergence. This is load-bearing for the reported goodput and correctness under reconfiguration.
minor comments (1)
  1. Consider adding a short sentence in the abstract or introduction clarifying the exact reconfiguration scenarios (e.g., specific changes in tensor/pipeline/data parallelism) used in the multi-node evaluation to improve reproducibility context.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for highlighting the importance of data integrity in our live handoff mechanism. We respond to the major comment as follows and will update the manuscript to address the concern.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The description of the live handoff mechanism lacks any mention of data integrity checks, such as checksums or validation steps after the commit, which is essential to substantiate the claim that the handoff preserves exact weights, optimizer state, and computation semantics without corruption or divergence. This is load-bearing for the reported goodput and correctness under reconfiguration.

    Authors: We agree with the referee that the abstract would benefit from explicitly addressing data integrity to support our claims of exact state preservation. LiveR performs the handoff by streaming reshaped model state directly over the interconnect while the source world continues training. The target world is prepared asynchronously, and the commit is a lightweight switch after all state has been received. Since the transfer uses reliable, ordered delivery provided by the distributed runtime (PyTorch distributed with NCCL), bit-level integrity is maintained without explicit checksums in the current implementation. To make this clear, we will revise the abstract to note that the mechanism ensures preservation of exact weights, optimizer state, and semantics through direct streaming and post-reconfiguration synchronization. Additionally, we will add a short paragraph in the system design section describing the absence of corruption in our evaluations and the reliance on the underlying reliable transport. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems evaluation with direct measurements

full rationale

This is an empirical systems paper describing the design, implementation, and benchmarking of LiveR for live reconfiguration in elastic LLM training. All central claims (seconds-scale downtime, 14-23x speedup over checkpoint/restart, 99% goodput) are obtained via direct wall-clock measurements on a multi-node GPU cluster using Megatron-LM and PyTorch. No mathematical derivations, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. The paper contains no self-citation load-bearing steps, uniqueness theorems, or ansatzes; results follow from implementation details and experimental runs rather than any self-referential logic. The evaluation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on assumptions about cluster hardware and the feasibility of online state reshaping rather than new mathematical axioms or fitted parameters.

axioms (1)
  • domain assumption High-bandwidth interconnects allow direct streaming of model state with low latency and no corruption during online reshaping.
    Invoked to support the live handoff mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5852 in / 1203 out tokens · 34973 ms · 2026-05-22T04:23:03.811875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Amazon EC2 Spot Instances

    Amazon Web Services. Amazon EC2 Spot Instances. https://aws. amazon.com/ec2/spot/, 2024

  3. [3]

    Varuna: scalable, low-cost training of massive deep learning models

    Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ram- jee, and Nipun Kwatra. Varuna: scalable, low-cost training of massive deep learning models. InProceedings of the Seventeenth European Conference on Computer Systems, pages 472–487, 2022

  4. [4]

    The rising costs of training frontier ai models, 2025

    Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The rising costs of training frontier ai models, 2025

  5. [5]

    Parcae: Proactive,{Liveput- Optimized}{DNN} training on preemptible instances

    Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry Xu, Minjia Zhang, and Zhihao Jia. Parcae: Proactive,{Liveput- Optimized}{DNN} training on preemptible instances. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1121–1139, 2024

  6. [6]

    Usp: A unified sequence parallelism approach for long context generative ai, 2024

    Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai, 2024

  7. [7]

    Recycle: Resilient training of large dnns using pipeline adaptation, 2024

    Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, and Christos Kozyrakis. Recycle: Resilient training of large dnns using pipeline adaptation, 2024

  8. [8]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    Tiresias: A {GPU} cluster manager for distributed deep learning

    Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeong- jae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A {GPU} cluster manager for distributed deep learning. In16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 485–500, 2019

  10. [10]

    Pipetransformer: Automated elastic pipelining for distributed training of transformers, 2021

    Chaoyang He, Shen Li, Mahdi Soltanolkotabi, and Salman Avestimehr. Pipetransformer: Automated elastic pipelining for distributed training of transformers, 2021

  11. [11]

    Oobleck: Resilient distributed training of large models using pipeline templates

    Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowd- hury. Oobleck: Resilient distributed training of large models using pipeline templates. InProceedings of the 29th Symposium on Operat- ing Systems Principles, pages 382–395, 2023

  12. [12]

    Sia: Heterogeneity-aware, goodput-optimized ml-cluster scheduling

    Suhas Jayaram Subramanya, Daiyaan Arfeen, Shouxu Lin, Aurick Qiao, Zhihao Jia, and Gregory R Ganger. Sia: Heterogeneity-aware, goodput-optimized ml-cluster scheduling. InProceedings of the 29th Symposium on Operating Systems Principles, pages 642–657, 2023

  13. [13]

    Elaswave: An elastic-native system for scalable hybrid-parallel training, 2025

    Xueze Kang, Guangyu Xiang, Yuxin Wang, Hao Zhang, Yuchu Fang, Yuhang Zhou, Zhenheng Tang, Youhui Lv, Eliran Maman, Mark Wasser- man, Alon Zameret, Zhipeng Bian, Shushu Chen, Zhiyou Yu, Jin Wang, Xiaoyu Wu, Yang Zheng, Chen Tian, and Xiaowen Chu. Elaswave: An elastic-native system for scalable hybrid-parallel training, 2025

  14. [14]

    Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

  15. [15]

    Trainmover: An interruption-resilient and reliable ml training runtime, 2025

    ChonLam Lao, Minlan Yu, Aditya Akella, Jiamin Cao, Yu Guan, Pengcheng Zhang, Zhilong Zheng, Yichi Xu, Ennan Zhai, Dennis Cai, and Jiaqi Gao. Trainmover: An interruption-resilient and reliable ml training runtime, 2025

  16. [16]

    Pytorch distributed: Experiences on accelerating data parallel training, 2020

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training, 2020

  17. [17]

    Universal checkpoint- ing: A flexible and efficient distributed checkpointing system for {Large-Scale}{DNN} training with reconfigurable parallelism

    Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, and Minjia Zhang. Universal checkpoint- ing: A flexible and efficient distributed checkpointing system for {Large-Scale}{DNN} training with reconfigurable parallelism. In 2025 USENIX Annual Technical Conference (USENIX ATC 25), pages 1519–1534, 2025

  18. [18]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  19. [19]

    Ring attention with block- wise transformers for near-infinite context, 2023

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with block- wise transformers for near-infinite context, 2023

  20. [20]

    Themis: Fair and efficient {GPU} cluster scheduling

    Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient {GPU} cluster scheduling. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289–304, 2020

  21. [21]

    Mustafa Rafique, Franck Cappello, and Bogdan Nicolae

    Avinash Maurya, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. Datastates-llm: Scalable checkpointing for transformer models using composable state providers, 2026

  22. [22]

    Galvatron: Efficient transformer train- ing over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878, 2022

    Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer train- ing over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878, 2022

  23. [23]

    Pipedream: Generalized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019

  24. [24]

    Efficient large-scale language model training on gpu clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Con- ference for High Per...

  25. [25]

    Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

    Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}21), 2021

  26. [26]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning mod- els with over 100 billion parameters.Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020

  27. [27]

    Singularity: Planet-scale, preemptive and elastic scheduling of ai workloads, 2022

    Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwa- tra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh We- lankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rahman, Zhou Lin, Rahul Seetharaman, Cheng Xu, ...

  28. [28]

    Criugpu: Transparent checkpointing of gpu-accelerated workloads, 2025

    Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, and Rodrigo Bruno. Criugpu: Transparent checkpointing of gpu-accelerated workloads, 2025

  29. [29]

    Bamboo: Making preemptible instances resilient for affordable training of large {DNNs}

    John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. Bamboo: Making preemptible instances resilient for affordable training of large {DNNs}. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 497–513, 2023

  30. [30]

    Spotnik: Designing distributed machine learning for transient cloud resources

    Marcel Wagenländer, Luo Mai, Guo Li, and Peter Pietzuch. Spotnik: Designing distributed machine learning for transient cloud resources. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20), 2020

  31. [31]

    Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections

    Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 195–210. ACM, November 2024

  32. [32]

    Bytecheckpoint: A unified checkpointing system for large foundation model development, 2025

    Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for large foundation model development, 2025

  33. [33]

    Fastper- sist: Accelerating model checkpointing in deep learning.arXiv preprint arXiv:2406.13768, 2024

    Guanhua Wang, Olatunji Ruwase, Bing Xie, and Yuxiong He. Fastper- sist: Accelerating model checkpointing in deep learning.arXiv preprint arXiv:2406.13768, 2024

  34. [34]

    Gemini: Fast failure recovery in distributed training with in-memory checkpoints

    Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eu- gene Ng, and Yida Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProceedings of the 29th Symposium on Operating Systems Principles, pages 364–381, 2023

  35. [35]

    BigScience Workshop, :, Teven Le Scao, Angela Fan, Christo- pher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Web- son, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Mor...

  36. [36]

    Gandiva: Introspective cluster scheduling for deep learning

    Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. Gandiva: Introspective cluster scheduling for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595–610, 2018

  37. [37]

    {AntMan}: Dynamic scaling on {GPU} clusters for deep learning

    Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. {AntMan}: Dynamic scaling on {GPU} clusters for deep learning. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533–548, 2020

  38. [38]

    Gspmd: General and scalable parallelization for ml computation graphs, 2021

    Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yan- ping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. Gspmd: General and scalable parallelization for ml computation graphs, 2021

  39. [39]

    {SkyPilot}: An intercloud broker for sky computing

    Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, et al. {SkyPilot}: An intercloud broker for sky computing. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, 2023

  40. [40]

    Rubick: Exploiting job reconfigurabil- ity for deep learning cluster scheduling, 2024

    Xinyi Zhang, Hanyu Zhao, Wencong Xiao, Xianyan Jia, Fei Xu, Yong Li, Wei Lin, and Fangming Liu. Rubick: Exploiting job reconfigurabil- ity for deep learning cluster scheduling, 2024

  41. [41]

    Alpa: Automating inter-and {Intra-Operator} par- allelism for distributed deep learning

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. Alpa: Automating inter-and {Intra-Operator} par- allelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022. A Appendix This appendi...

  42. [42]

    The staging buffer B, allocated once before the loop (Al- gorithm 1, line 13) and reused across all layers

  43. [43]

    Weights of LayerL

    The assembled shard for layer ℓ, which is written directly into the pre-allocated parameter storage of the new con- figuration (not additional memory, since this storage is required for training regardless). On a source rank, no additional memory is allocated: the source reads slices from existing parameter storage and sends them via ISend (line 9–10), wh...