Production logs from a 504-GPU LLM training cluster show 100% failure detection via multi-metric analysis, NFS saturation limiting bandwidth to 1.4-10.4% of link speed, and auto-retry achieving 33.3% success versus 12.5% manual.
Topology-aware gpu scheduling for learning workloads in cloud environments
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1roles
background 1polarities
background 1representative citing papers
citing papers explorer
-
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs
Production logs from a 504-GPU LLM training cluster show 100% failure detection via multi-metric analysis, NFS saturation limiting bandwidth to 1.4-10.4% of link speed, and auto-retry achieving 33.3% success versus 12.5% manual.