PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

· 2025 · cs.DC · arXiv 2510.15596

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and as GPUs are operated in increasingly power-limited and thermally-stressed environments. At the 64,000+ GPU scale, we already observe 9% GPU time variability for frontier foundation model training. Motivated by our analysis and the large design space around performance variability, we present PRISM -- a performance modeling framework that captures the stochastic nature of large-scale distributed training. The core of PRISM is a statistical method that quantifies probabilistic guarantees on training time. Using PRISM, we explore the design and optimization space of distributed training, enabling principled, variability-aware decisions that improve performance and system efficiency at scale.

representative citing papers

ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training

cs.DC · 2026-05-23 · unverdicted · novelty 4.0

ScaleAcross Explorer jointly optimizes three design dimensions for scale-across training and reports up to 64.62% speedups over production baselines and 37.59% over prior art in testbed and simulation experiments.

citing papers explorer

Showing 1 of 1 citing paper.

ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training cs.DC · 2026-05-23 · unverdicted · none · ref 9 · internal anchor
ScaleAcross Explorer jointly optimizes three design dimensions for scale-across training and reports up to 64.62% speedups over production baselines and 37.59% over prior art in testbed and simulation experiments.

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

fields

years

verdicts

representative citing papers

citing papers explorer