pith. sign in

arxiv: 2412.12636 · v3 · pith:2OKGTL5Hnew · submitted 2024-12-17 · 💻 cs.DC · cs.AI· cs.LG· cs.PF

TrainMover: An Interruption-Resilient Runtime for ML Training

classification 💻 cs.DC cs.AIcs.LGcs.PF
keywords trainmoverruntimetrainingdowntimeinterruptionsscalestandbyachieve
0
0 comments X
read the original abstract

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpoint-restart or runtime reconfiguration suffer from long downtimes and degraded performance. We present TrainMover, a resilient LLM training runtime that leverages elastic and standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces three key techniques: two-phase, delta-based communication group setup; communication-free sandboxed warmup; and general standby design that enables failure recovery from any role. Our evaluation shows that TrainMover consistently achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale. TrainMover is projected to reduce wasted GPU hours by 55% compared to the best alternative, saving 1.4 million GPU-hours per week at the 64K-GPU scale.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.