Production-scale empirical study of a 63-node 504-GPU cluster reports multi-signal failure detection needs, low checkpoint bandwidth utilization, heavy-tailed node exclusions, and 2.7x higher success for auto-retry chains.
Oobleck: Resilient distributed training of large models using pipeline templates
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1roles
background 1polarities
background 1representative citing papers
citing papers explorer
-
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs
Production-scale empirical study of a 63-node 504-GPU cluster reports multi-signal failure detection needs, low checkpoint bandwidth utilization, heavy-tailed node exclusions, and 2.7x higher success for auto-retry chains.