LiveR: Fine-Grained Elasticity via Live Reconfiguration for Model Training
Pith reviewed 2026-05-22 04:23 UTC · model grok-4.3
The pith
LiveR replaces checkpoint restarts with live model state handoff to enable fast elasticity in large model training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LiveR performs a live bounded-memory handoff between mixed-parallel training worlds for elastic LLM training. While the current configuration keeps training, the system asynchronously prepares the target world, bootstraps added workers in isolation, streams model state directly over high-bandwidth links, and reshapes it online across tensor, pipeline, and data parallel dimensions before a lightweight commit switches execution without stop-and-restart.
What carries the argument
The live handoff that streams and reshapes model state across parallel dimensions while the original training continues.
If this is right
- Reconfiguration time falls from minutes to seconds.
- Reconfiguration runs 14 to 23 times faster than checkpoint and restart methods.
- Steady-state overhead stays low.
- Training goodput reaches up to 99 percent under volatile resource conditions.
Where Pith is reading between the lines
- The same handoff idea could support dynamic scaling in other distributed workloads that move large state.
- Cluster schedulers could trigger more frequent resource changes if live reconfiguration becomes reliable.
- Lower-bandwidth networks might require extra buffering or compression to keep the approach viable.
Load-bearing premise
Model state can be streamed and reshaped to a new parallel setup without data corruption or loss of training correctness during the handoff.
What would settle it
Run repeated resource additions and removals during training and check whether each switch completes in seconds with no accuracy loss compared to a non-elastic run.
Figures
read the original abstract
To reduce user costs and maximize cluster utilization, large model training increasingly leverages volatile but inexpensive GPU capacity, such as spot instances and reclaimable resources in shared clusters. Yet, capitalizing on these economic benefits requires jobs to adapt within the short warning windows that many such environments provide. Existing elastic training systems still treat reconfiguration as stop-and-restart: they externalize distributed state through checkpoints, rebuild the distributed runtime on a new topology, and restart training, turning each resize event into a storage-heavy recovery procedure that incurs substantial downtime from checkpoint I/O, process restart, CUDA initialization, and communicator setup. We present LiveR, a live reconfiguration runtime for elastic LLM training that replaces storage-backed restart with a live, bounded-memory handoff between mixed-parallel training worlds. While the current world continues training, LiveR asynchronously prepares the target world, bootstraps newly added workers in isolation to keep heavyweight initialization off the critical path, and streams model state directly over high-bandwidth interconnects while reshaping it online across tensor, pipeline, and data parallel dimensions. Once the target world is ready, LiveR performs a lightweight commit that switches training to the new configuration without stop-and-restart on the live path. We implement LiveR atop Megatron-LM and PyTorch and evaluate it end-to-end on a multi-node GPU cluster. Across diverse reconfiguration scenarios, LiveR reduces downtime from minutes to seconds, accelerates reconfiguration by 14$\times$-23$\times$ over checkpoint/restart baselines, incurs minimal steady-state overhead, and sustains up to 99% training goodput under volatile-resource conditions, making volatile low-cost GPU capacity far more practical for LLM training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents LiveR, a live reconfiguration runtime for elastic LLM training on volatile GPU resources. It enables fine-grained elasticity by performing a live handoff of model state between mixed-parallel training worlds without stop-and-restart, using asynchronous preparation, isolated bootstrapping of new workers, and online streaming and reshaping of state across tensor, pipeline, and data parallel dimensions. End-to-end evaluation on a multi-node cluster demonstrates downtime reduction from minutes to seconds, 14×-23× acceleration over checkpoint/restart, minimal steady-state overhead, and up to 99% training goodput.
Significance. If the central claims hold, this work has high significance for distributed systems and ML training, as it addresses a key barrier to using cost-effective but volatile resources like spot instances for large-scale model training. The provision of an end-to-end implementation atop Megatron-LM and PyTorch with multi-node evaluation and quantified performance improvements strengthens the contribution.
major comments (1)
- [Abstract] Abstract: The description of the live handoff mechanism lacks any mention of data integrity checks, such as checksums or validation steps after the commit, which is essential to substantiate the claim that the handoff preserves exact weights, optimizer state, and computation semantics without corruption or divergence. This is load-bearing for the reported goodput and correctness under reconfiguration.
minor comments (1)
- Consider adding a short sentence in the abstract or introduction clarifying the exact reconfiguration scenarios (e.g., specific changes in tensor/pipeline/data parallelism) used in the multi-node evaluation to improve reproducibility context.
Simulated Author's Rebuttal
We are grateful to the referee for highlighting the importance of data integrity in our live handoff mechanism. We respond to the major comment as follows and will update the manuscript to address the concern.
read point-by-point responses
-
Referee: [Abstract] Abstract: The description of the live handoff mechanism lacks any mention of data integrity checks, such as checksums or validation steps after the commit, which is essential to substantiate the claim that the handoff preserves exact weights, optimizer state, and computation semantics without corruption or divergence. This is load-bearing for the reported goodput and correctness under reconfiguration.
Authors: We agree with the referee that the abstract would benefit from explicitly addressing data integrity to support our claims of exact state preservation. LiveR performs the handoff by streaming reshaped model state directly over the interconnect while the source world continues training. The target world is prepared asynchronously, and the commit is a lightweight switch after all state has been received. Since the transfer uses reliable, ordered delivery provided by the distributed runtime (PyTorch distributed with NCCL), bit-level integrity is maintained without explicit checksums in the current implementation. To make this clear, we will revise the abstract to note that the mechanism ensures preservation of exact weights, optimizer state, and semantics through direct streaming and post-reconfiguration synchronization. Additionally, we will add a short paragraph in the system design section describing the absence of corruption in our evaluations and the reliance on the underlying reliable transport. revision: yes
Circularity Check
No circularity: empirical systems evaluation with direct measurements
full rationale
This is an empirical systems paper describing the design, implementation, and benchmarking of LiveR for live reconfiguration in elastic LLM training. All central claims (seconds-scale downtime, 14-23x speedup over checkpoint/restart, 99% goodput) are obtained via direct wall-clock measurements on a multi-node GPU cluster using Megatron-LM and PyTorch. No mathematical derivations, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. The paper contains no self-citation load-bearing steps, uniqueness theorems, or ansatzes; results follow from implementation details and experimental runs rather than any self-referential logic. The evaluation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-bandwidth interconnects allow direct streaming of model state with low latency and no corruption during online reshaping.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Amazon Web Services. Amazon EC2 Spot Instances. https://aws. amazon.com/ec2/spot/, 2024
work page 2024
-
[3]
Varuna: scalable, low-cost training of massive deep learning models
Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ram- jee, and Nipun Kwatra. Varuna: scalable, low-cost training of massive deep learning models. InProceedings of the Seventeenth European Conference on Computer Systems, pages 472–487, 2022
work page 2022
-
[4]
The rising costs of training frontier ai models, 2025
Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The rising costs of training frontier ai models, 2025
work page 2025
-
[5]
Parcae: Proactive,{Liveput- Optimized}{DNN} training on preemptible instances
Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry Xu, Minjia Zhang, and Zhihao Jia. Parcae: Proactive,{Liveput- Optimized}{DNN} training on preemptible instances. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1121–1139, 2024
work page 2024
-
[6]
Usp: A unified sequence parallelism approach for long context generative ai, 2024
Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai, 2024
work page 2024
-
[7]
Recycle: Resilient training of large dnns using pipeline adaptation, 2024
Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, and Christos Kozyrakis. Recycle: Resilient training of large dnns using pipeline adaptation, 2024
work page 2024
-
[8]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Tiresias: A {GPU} cluster manager for distributed deep learning
Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeong- jae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A {GPU} cluster manager for distributed deep learning. In16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 485–500, 2019
work page 2019
-
[10]
Pipetransformer: Automated elastic pipelining for distributed training of transformers, 2021
Chaoyang He, Shen Li, Mahdi Soltanolkotabi, and Salman Avestimehr. Pipetransformer: Automated elastic pipelining for distributed training of transformers, 2021
work page 2021
-
[11]
Oobleck: Resilient distributed training of large models using pipeline templates
Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowd- hury. Oobleck: Resilient distributed training of large models using pipeline templates. InProceedings of the 29th Symposium on Operat- ing Systems Principles, pages 382–395, 2023
work page 2023
-
[12]
Sia: Heterogeneity-aware, goodput-optimized ml-cluster scheduling
Suhas Jayaram Subramanya, Daiyaan Arfeen, Shouxu Lin, Aurick Qiao, Zhihao Jia, and Gregory R Ganger. Sia: Heterogeneity-aware, goodput-optimized ml-cluster scheduling. InProceedings of the 29th Symposium on Operating Systems Principles, pages 642–657, 2023
work page 2023
-
[13]
Elaswave: An elastic-native system for scalable hybrid-parallel training, 2025
Xueze Kang, Guangyu Xiang, Yuxin Wang, Hao Zhang, Yuchu Fang, Yuhang Zhou, Zhenheng Tang, Youhui Lv, Eliran Maman, Mark Wasser- man, Alon Zameret, Zhipeng Bian, Shushu Chen, Zhiyou Yu, Jin Wang, Xiaoyu Wu, Yang Zheng, Chen Tian, and Xiaowen Chu. Elaswave: An elastic-native system for scalable hybrid-parallel training, 2025
work page 2025
-
[14]
Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020
work page 2020
-
[15]
Trainmover: An interruption-resilient and reliable ml training runtime, 2025
ChonLam Lao, Minlan Yu, Aditya Akella, Jiamin Cao, Yu Guan, Pengcheng Zhang, Zhilong Zheng, Yichi Xu, Ennan Zhai, Dennis Cai, and Jiaqi Gao. Trainmover: An interruption-resilient and reliable ml training runtime, 2025
work page 2025
-
[16]
Pytorch distributed: Experiences on accelerating data parallel training, 2020
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training, 2020
work page 2020
-
[17]
Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, and Minjia Zhang. Universal checkpoint- ing: A flexible and efficient distributed checkpointing system for {Large-Scale}{DNN} training with reconfigurable parallelism. In 2025 USENIX Annual Technical Conference (USENIX ATC 25), pages 1519–1534, 2025
work page 2025
-
[18]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Ring attention with block- wise transformers for near-infinite context, 2023
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with block- wise transformers for near-infinite context, 2023
work page 2023
-
[20]
Themis: Fair and efficient {GPU} cluster scheduling
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient {GPU} cluster scheduling. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289–304, 2020
work page 2020
-
[21]
Mustafa Rafique, Franck Cappello, and Bogdan Nicolae
Avinash Maurya, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. Datastates-llm: Scalable checkpointing for transformer models using composable state providers, 2026
work page 2026
-
[22]
Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer train- ing over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878, 2022
-
[23]
Pipedream: Generalized pipeline parallelism for dnn training
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019
work page 2019
-
[24]
Efficient large-scale language model training on gpu clusters using megatron-lm
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the International Con- ference for High Per...
work page 2021
-
[25]
Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning
Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R Ganger, and Eric P Xing. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}21), 2021
work page 2021
-
[26]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning mod- els with over 100 billion parameters.Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020
work page 2020
-
[27]
Singularity: Planet-scale, preemptive and elastic scheduling of ai workloads, 2022
Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwa- tra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh We- lankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rahman, Zhou Lin, Rahul Seetharaman, Cheng Xu, ...
work page 2022
-
[28]
Criugpu: Transparent checkpointing of gpu-accelerated workloads, 2025
Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, and Rodrigo Bruno. Criugpu: Transparent checkpointing of gpu-accelerated workloads, 2025
work page 2025
-
[29]
Bamboo: Making preemptible instances resilient for affordable training of large {DNNs}
John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. Bamboo: Making preemptible instances resilient for affordable training of large {DNNs}. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 497–513, 2023
work page 2023
-
[30]
Spotnik: Designing distributed machine learning for transient cloud resources
Marcel Wagenländer, Luo Mai, Guo Li, and Peter Pietzuch. Spotnik: Designing distributed machine learning for transient cloud resources. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20), 2020
work page 2020
-
[31]
Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections
Marcel Wagenländer, Guo Li, Bo Zhao, Luo Mai, and Peter Pietzuch. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 195–210. ACM, November 2024
work page 2024
-
[32]
Bytecheckpoint: A unified checkpointing system for large foundation model development, 2025
Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for large foundation model development, 2025
work page 2025
-
[33]
Guanhua Wang, Olatunji Ruwase, Bing Xie, and Yuxiong He. Fastper- sist: Accelerating model checkpointing in deep learning.arXiv preprint arXiv:2406.13768, 2024
-
[34]
Gemini: Fast failure recovery in distributed training with in-memory checkpoints
Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eu- gene Ng, and Yida Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProceedings of the 29th Symposium on Operating Systems Principles, pages 364–381, 2023
work page 2023
-
[35]
BigScience Workshop, :, Teven Le Scao, Angela Fan, Christo- pher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Web- son, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Mor...
work page 2023
-
[36]
Gandiva: Introspective cluster scheduling for deep learning
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. Gandiva: Introspective cluster scheduling for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595–610, 2018
work page 2018
-
[37]
{AntMan}: Dynamic scaling on {GPU} clusters for deep learning
Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. {AntMan}: Dynamic scaling on {GPU} clusters for deep learning. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533–548, 2020
work page 2020
-
[38]
Gspmd: General and scalable parallelization for ml computation graphs, 2021
Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yan- ping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. Gspmd: General and scalable parallelization for ml computation graphs, 2021
work page 2021
-
[39]
{SkyPilot}: An intercloud broker for sky computing
Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, et al. {SkyPilot}: An intercloud broker for sky computing. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, 2023
work page 2023
-
[40]
Rubick: Exploiting job reconfigurabil- ity for deep learning cluster scheduling, 2024
Xinyi Zhang, Hanyu Zhao, Wencong Xiao, Xianyan Jia, Fei Xu, Yong Li, Wei Lin, and Fangming Liu. Rubick: Exploiting job reconfigurabil- ity for deep learning cluster scheduling, 2024
work page 2024
-
[41]
Alpa: Automating inter-and {Intra-Operator} par- allelism for distributed deep learning
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. Alpa: Automating inter-and {Intra-Operator} par- allelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022. A Appendix This appendi...
work page 2022
-
[42]
The staging buffer B, allocated once before the loop (Al- gorithm 1, line 13) and reused across all layers
-
[43]
The assembled shard for layer ℓ, which is written directly into the pre-allocated parameter storage of the new con- figuration (not additional memory, since this storage is required for training regardless). On a source rank, no additional memory is allocated: the source reads slices from existing parameter storage and sends them via ISend (line 9–10), wh...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.