DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.
Analysis of large-scale multi-tenant gpu clusters for dnn training workloads,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint
DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.