TrainMover: An Interruption-Resilient Runtime for ML Training

Aditya Akella; ChonLam Lao; Dennis Cai; Ennan Zhai; Jiamin Cao; Jiangfei Duan; Jiaqi Gao; Jingren Zhou; Minlan Yu; Pengcheng Zhang

arxiv: 2412.12636 · v3 · pith:2OKGTL5Hnew · submitted 2024-12-17 · 💻 cs.DC · cs.AI· cs.LG· cs.PF

TrainMover: An Interruption-Resilient Runtime for ML Training

ChonLam Lao , Jiaqi Gao , Jiamin Cao , Zhipeng Zhang , Pengcheng Zhang , Jiangfei Duan , Zhilong Zheng , Yu Guan

show 8 more authors

Yichi Xu Yong Li Zhengping Qian Aditya Akella Minlan Yu Ennan Zhai Dennis Cai Jingren Zhou

This is my paper

classification 💻 cs.DC cs.AIcs.LGcs.PF

keywords trainmoverruntimetrainingdowntimeinterruptionsscalestandbyachieve

0 comments

read the original abstract

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpoint-restart or runtime reconfiguration suffer from long downtimes and degraded performance. We present TrainMover, a resilient LLM training runtime that leverages elastic and standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces three key techniques: two-phase, delta-based communication group setup; communication-free sandboxed warmup; and general standby design that enables failure recovery from any role. Our evaluation shows that TrainMover consistently achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale. TrainMover is projected to reduce wasted GPU hours by 55% compared to the best alternative, saving 1.4 million GPU-hours per week at the 64K-GPU scale.

This paper has not been read by Pith yet.

TrainMover: An Interruption-Resilient Runtime for ML Training

discussion (0)