ScaleAcross Explorer jointly optimizes three design dimensions for scale-across training and reports up to 64.62% speedups over production baselines and 37.59% over prior art in testbed and simulation experiments.
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and as GPUs are operated in increasingly power-limited and thermally-stressed environments. At the 64,000+ GPU scale, we already observe 9% GPU time variability for frontier foundation model training. Motivated by our analysis and the large design space around performance variability, we present PRISM -- a performance modeling framework that captures the stochastic nature of large-scale distributed training. The core of PRISM is a statistical method that quantifies probabilistic guarantees on training time. Using PRISM, we explore the design and optimization space of distributed training, enabling principled, variability-aware decisions that improve performance and system efficiency at scale.
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training
ScaleAcross Explorer jointly optimizes three design dimensions for scale-across training and reports up to 64.62% speedups over production baselines and 37.59% over prior art in testbed and simulation experiments.