pith. sign in

arxiv: 2510.15596 · v2 · submitted 2025-10-17 · 💻 cs.DC

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

Pith reviewed 2026-05-18 06:24 UTC · model grok-4.3

classification 💻 cs.DC
keywords distributed trainingperformance modelingprobabilistic guaranteesruntime variabilitylarge-scale systemsstochastic processesGPU trainingsystem efficiency
0
0 comments X

The pith

PRISM supplies a statistical method to quantify probabilistic guarantees on training duration for distributed systems at 64,000+ GPU scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large model training at extreme scales turns runtime disruptions into a regular stochastic process rather than rare events, with 9% GPU time variability already measured at 64,000 GPUs. PRISM introduces a performance modeling framework whose central statistical technique captures this variability and produces explicit probability statements about how long a full training run will take. The work matters because it replaces reliance on average-case estimates with variability-aware analysis that supports concrete choices about hardware configurations, power limits, and thermal management. A reader following the argument would conclude that design decisions can now incorporate quantified risk of delays instead of treating them as unmanageable noise.

Core claim

PRISM is a performance modeling framework that captures the stochastic nature of large-scale distributed training through a statistical method that quantifies probabilistic guarantees on training time. At the 64,000+ GPU scale the authors observe 9% GPU time variability, and they use the framework to explore the design and optimization space, enabling principled, variability-aware decisions that improve performance and system efficiency.

What carries the argument

A statistical method inside PRISM that treats runtime variations as a stochastic process and computes the probability that training completes within a given duration.

If this is right

  • Designers can rank system configurations by the probability they meet a target training deadline rather than by average speed alone.
  • Power and thermal limits can be set with explicit risk budgets for unexpected slowdowns.
  • Optimization loops gain the ability to trade peak throughput against reduced tail latency in completion times.
  • Productivity forecasts for frontier models include confidence bands instead of single-point estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware procurement and cluster sizing decisions may begin to require variability statistics alongside peak FLOPS ratings.
  • Resource schedulers could adopt similar probabilistic models to allocate jobs with known risk of overrunning their time slots.
  • The same statistical lens might extend to other variable large-scale workloads such as scientific simulations on heterogeneous clusters.

Load-bearing premise

The 9% GPU time variability measured at 64,000 GPUs remains representative of the underlying stochastic process at still larger scales and under different power or thermal constraints.

What would settle it

Collect training-time distributions from a run at 128,000 GPUs or in a distinctly more power-limited regime and test whether the observed probabilities fall inside the intervals predicted by a model fitted only on the 64,000-GPU data.

Figures

Figures reproduced from arXiv: 2510.15596 by Alicia Golden, Carole-Jean Wu, David Brooks, Gu-Yeon Wei, Michael Kuchnik, Samuel Hsia, Zachary DeVito.

Figure 1
Figure 1. Figure 1: Distribution of compute and communication time for [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of Latency Variation. There are a multitude [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: A significant source of computation latency variability [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Communication primitive distributions. Pytorch communication collective AllReduce profiled on 1 and 8 8xH200 [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Observed variation impacts training step time across 64K+ GPU job. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of PRISM Framework. Given a model architecture, parallelism strategies and training hardware specification, [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: PRISM Validation on real-world training jobs. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: CDF of normalized slowdown comparing sensitivity [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: The ordering of slow nodes can have a significant [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: RTT distribution across datacenters based on their [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of individual kernel variation (x-asix) on [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sweeping cross-datacenter bandwidth, with distance [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
read the original abstract

Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and as GPUs are operated in increasingly power-limited and thermally-stressed environments. At the 64,000+ GPU scale, we already observe 9% GPU time variability for frontier foundation model training. Motivated by our analysis and the large design space around performance variability, we present PRISM -- a performance modeling framework that captures the stochastic nature of large-scale distributed training. The core of PRISM is a statistical method that quantifies probabilistic guarantees on training time. Using PRISM, we explore the design and optimization space of distributed training, enabling principled, variability-aware decisions that improve performance and system efficiency at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PRISM, a performance modeling framework for large-scale distributed training of foundation models. It reports an observed 9% GPU time variability at the 64,000+ GPU scale due to stochastic disruptions that increase with scale and power/thermal limits, and presents a statistical method as the core of PRISM to quantify probabilistic guarantees on training time. This enables exploration of the design and optimization space for variability-aware decisions that improve performance and system efficiency.

Significance. If the statistical method can be shown to deliver accurate, generalizable probabilistic bounds, the work would address a timely and practically important challenge in scaling distributed training beyond current frontiers. The explicit observation of 9% variability at 64k GPUs provides a concrete starting point, and the emphasis on variability-aware optimization could influence system design practices. The framework's focus on stochastic processes rather than deterministic models is a conceptual strength.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'the core of PRISM is a statistical method that quantifies probabilistic guarantees on training time' is presented without any derivation, model equations, distribution assumptions, validation data, error bars, or baseline comparisons. This is load-bearing because the probabilistic guarantees are the primary technical contribution enabling the claimed variability-aware decisions.
  2. [Abstract / motivation section] The manuscript relies on the 9% GPU-time variability observed at 64k GPUs as the basis for the statistical model, but provides no analysis or tests demonstrating that the underlying stochastic process (distribution family, tail behavior, or correlation structure) remains invariant at larger scales or under intensified power/thermal constraints. If variance grows super-linearly or new throttling-induced correlations appear, the fitted guarantees would be mis-calibrated.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly named the statistical technique (e.g., specific distribution family or fitting procedure) rather than referring only to 'a statistical method'.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the presentation of our statistical method and clarify the scope of our empirical observations. We address each point below and have prepared revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'the core of PRISM is a statistical method that quantifies probabilistic guarantees on training time' is presented without any derivation, model equations, distribution assumptions, validation data, error bars, or baseline comparisons. This is load-bearing because the probabilistic guarantees are the primary technical contribution enabling the claimed variability-aware decisions.

    Authors: The abstract is a concise overview of the contribution. The full manuscript (Sections 3–5) contains the derivation of the statistical method, model equations, distribution assumptions fitted to runtime traces, validation data with error bars, and comparisons against deterministic baselines. To make the central claim more self-contained in the abstract, we have added a brief reference to the method's empirical foundation and validation approach while directing readers to the detailed sections. revision: yes

  2. Referee: [Abstract / motivation section] The manuscript relies on the 9% GPU-time variability observed at 64k GPUs as the basis for the statistical model, but provides no analysis or tests demonstrating that the underlying stochastic process (distribution family, tail behavior, or correlation structure) remains invariant at larger scales or under intensified power/thermal constraints. If variance grows super-linearly or new throttling-induced correlations appear, the fitted guarantees would be mis-calibrated.

    Authors: The 9% variability figure is an empirical observation from our 64k-GPU runs, which we use to fit and validate the model parameters, including distribution family and correlation structure at that scale. We have added a dedicated limitations subsection that explicitly discusses the assumption of scale invariance, the risk of mis-calibration if variance grows super-linearly or new power/thermal correlations emerge, and the modular nature of the framework that permits recalibration with new traces. We cannot provide direct empirical tests at scales substantially beyond 64k GPUs. revision: partial

standing simulated objections not resolved
  • Empirical verification of stochastic process invariance (distribution family, tail behavior, correlation structure) at scales significantly larger than 64k GPUs under intensified power/thermal constraints

Circularity Check

0 steps flagged

No significant circularity in PRISM derivation chain

full rationale

The paper's core claim rests on a statistical method for probabilistic guarantees derived from observed 9% GPU time variability at 64k-GPU scale. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations are present in the abstract or described framework. The modeling starts from external empirical observations and applies standard statistical quantification without reducing outputs to inputs by construction. This is the common honest finding for papers whose central contribution is empirical modeling rather than closed-form derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The framework appears to rest on the domain assumption that runtime disruptions form a stationary statistical process whose parameters can be estimated from limited-scale observations and then extrapolated.

free parameters (1)
  • variability distribution parameters
    Statistical model must fit parameters to the observed 9% GPU-time variability; exact form and fitting procedure not stated.
axioms (1)
  • domain assumption Runtime variability in large-scale training can be captured by a statistical distribution that yields usable probabilistic guarantees.
    This assumption underpins the entire claim that PRISM quantifies guarantees on training time.

pith-pipeline@v0.9.0 · 5701 in / 1368 out tokens · 52011 ms · 2026-05-18T06:24:55.582352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

  1. [1]

    Roce networks for distributed ai training at scale,

    J. H. Z. Adi Gangidi, “Roce networks for distributed ai training at scale,”

  2. [2]

    Available: https://engineering.fb.com/2024/08/05/data- center-engineering/roce-network-distributed-ai-training-at-scale/

    [Online]. Available: https://engineering.fb.com/2024/08/05/data- center-engineering/roce-network-distributed-ai-training-at-scale/

  3. [3]

    Aws region latency matrix,

    M. Adorjan, “Aws region latency matrix,” 2025. [Online]. Available: https://www.cloudping.co/

  4. [4]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints,

    J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebron, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec....

  5. [5]

    ”energon

    A. Chaudhuri, S. Shukla, S. Bhattacharya, and D. Mukhopadhyay, “”energon”: Unveiling transformers from gpu power and thermal side- channels,” 2025. [Online]. Available: https://arxiv.org/abs/2508.01768

  6. [6]

    Crosspipe: Towards optimal pipeline schedules for cross-datacenter training,

    T. Chen, A. Kubicek, L. Huang, and T. Hoefler, “Crosspipe: Towards optimal pipeline schedules for cross-datacenter training,” 2025. [Online]. Available: https://arxiv.org/abs/2507.00217

  7. [7]

    Fboss: building switch software at scale,

    S. Choi, B. Burkov, A. Eckert, T. Fang, S. Kazemkhani, R. Sherwood, Y . Zhang, and H. Zeng, “Fboss: building switch software at scale,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, ser. SIGCOMM ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 342–356. [Online]. Available: https://doi....

  8. [8]

    How much does it cost to train frontier ai models?

    B. Cottier, R. Rahman, L. Fattorini, N. Maslej, and D. Owen, “How much does it cost to train frontier ai models?” 2025. [Online]. Available: https://epoch.ai/blog/how-much-does-it-cost-to-train-frontier-ai-models

  9. [9]

    Delay based congestion control for cross-datacenter networks,

    Y . Geng, H. Zhang, X. Shi, J. Wang, X. Yin, D. He, and Y . Li, “Delay based congestion control for cross-datacenter networks,” in 2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS), 2023, pp. 1–4

  10. [10]

    Is flash attention stable?arXiv preprint arXiv:2405.02803, 2024

    A. Golden, S. Hsia, F. Sun, B. Acun, B. Hosmer, Y . Lee, Z. DeVito, J. Johnson, G.-Y . Wei, D. Brooks, and C.-J. Wu, “Is flash attention stable?” 2024. [Online]. Available: https://arxiv.org/abs/2405.02803

  11. [11]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” 2024. [Online]. Available: https://arxiv.org/abs/ 2312.00752

  12. [12]

    Comparative analysis of cpu and gpu profiling for deep learning models,

    D. Gyawali, “Comparative analysis of cpu and gpu profiling for deep learning models,” 2023. [Online]. Available: https://arxiv.org/abs/2309. 02521

  13. [13]

    Mad-max beyond single-node: Enabling large machine learning model acceleration on distributed systems,

    S. Hsia, A. Golden, B. Acun, N. Ardalani, Z. DeVito, G.-Y . Wei, D. Brooks, and C.-J. Wu, “Mad-max beyond single-node: Enabling large machine learning model acceleration on distributed systems,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 818–833

  14. [14]

    Mp-rec: Hardware-software co-design to enable multi-path recommendation,

    S. Hsia, U. Gupta, B. Acun, N. Ardalani, P. Zhong, G.-Y . Wei, D. Brooks, and C.-J. Wu, “Mp-rec: Hardware-software co-design to enable multi-path recommendation,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, ser. ASPLOS 2023. New York, NY , USA: Association for ...

  15. [15]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

    Y . Huang, Y . Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” 2019. [Online]. Available: https://arxiv.org/abs/1811.06965

  16. [16]

    Megascale: scaling large language model training to more than 10,000 gpus,

    Z. Jiang, H. Lin, Y . Zhong, Q. Huang, Y . Chen, Z. Zhang, Y . Peng, X. Li, C. Xie, S. Nong, Y . Jia, S. He, H. Chen, Z. Bai, Q. Hou, S. Yan, D. Zhou, Y . Sheng, Z. Jiang, H. Xu, H. Wei, Z. Zhang, P. Nie, L. Zou, S. Zhao, L. Xiang, Z. Liu, Z. Li, X. Jia, J. Ye, X. Jin, and X. Liu, “Megascale: scaling large language model training to more than 10,000 gpus,...

  17. [17]

    BPipe: Memory-balanced pipeline parallelism for training large language models,

    T. Kim, H. Kim, G.-I. Yu, and B.-G. Chun, “BPipe: Memory-balanced pipeline parallelism for training large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, ...

  18. [18]

    Kulkarni

    A. Kokolis, M. Kuchnik, J. Hoffman, A. Kumar, P. Malani, F. Ma, Z. DeVito, S. Sengupta, K. Saladi, and C.-J. Wu, “ Revisiting Reliability in Large-Scale Machine Learning Research Clusters ,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). Los Alamitos, CA, USA: IEEE Computer Society, Mar. 2025, pp. 1259–1274. [Online]...

  19. [19]

    Cooling matters: Benchmarking large language models and vision-language models on liquid-cooled versus air-cooled h100 gpu systems,

    I. Latif, M. A. Shafique, H. Ullah, A. C. Newkirk, X. Yu, and A. Munir, “Cooling matters: Benchmarking large language models and vision-language models on liquid-cooled versus air-cooled h100 gpu systems,” 2025. [Online]. Available: https://arxiv.org/abs/2507.16781

  20. [20]

    Lumos: Efficient performance modeling and estimation for large-scale LLM training,

    M. Liang, H. T. Kassa, W. Fu, B. Coutinho, L. Feng, and C. Delimitrou, “Lumos: Efficient performance modeling and estimation for large-scale LLM training,” inEighth Conference on Machine Learning and Systems, 2025. [Online]. Available: https://openreview.net/forum?id= mwEOLauKI5

  21. [21]

    Understanding stragglers in large model training using what-if analysis,

    J. Lin, Z. Jiang, Z. Song, S. Zhao, and M. Yu, “Understanding stragglers in large model training using what-if analysis,” inProceedings of the 19th USENIX Symposium on Operating Systems Design and Implemen- tation, 2025

  22. [22]

    The llama 4 herd: The beginning of a new era of natively mul- timodal ai innovation,

    Meta, “The llama 4 herd: The beginning of a new era of natively mul- timodal ai innovation,” https://ai.meta.com/blog/llama-4-multimodal- intelligence/, 2025

  23. [23]

    Timely: Rtt-based congestion control for the datacenter,

    R. Mittal, V . T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y . Wang, D. Wetherall, and D. Zats, “Timely: Rtt-based congestion control for the datacenter,”SIGCOMM Comput. Commun. Rev., vol. 45, no. 4, p. 537–550, Aug. 2015. [Online]. Available: https://doi.org/10.1145/2829988.2787510

  24. [24]

    2021 , isbn =

    D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on gpu clusters using megatron-lm,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Ana...

  25. [25]

    Nvidia management library (nvml),

    NVIDIA., “Nvidia management library (nvml),” inNvidia Developer,

  26. [26]

    Available: https://developer.nvidia.com/management- library-nvml

    [Online]. Available: https://developer.nvidia.com/management- library-nvml

  27. [27]

    Introducing gpt-5,

    OpenAI, “Introducing gpt-5,” 2025. [Online]. Available: https://openai. com/index/introducing-gpt-5/

  28. [28]

    Optimizing multi-gpu parallelization strategies for deep learning training,

    S. Pal, E. Ebrahimi, A. Zulfiqar, Y . Fu, V . Zhang, S. Migacz, D. Nellans, and P. Gupta, “Optimizing multi-gpu parallelization strategies for deep learning training,”IEEE Micro, vol. 39, no. 5, p. 91–101, Sep. 2019. [Online]. Available: http://dx.doi.org/10.1109/MM.2019.2935967

  29. [29]

    Multi-datacenter training: Openai’s ambitious plan to beat google’s infrastructure,

    D. Patel, D. Nishball, and J. E. Ontiveros, “Multi-datacenter training: Openai’s ambitious plan to beat google’s infrastructure,”

  30. [30]

    Available: https://semianalysis.com/2024/09/04/multi- datacenter-training-openais/

    [Online]. Available: https://semianalysis.com/2024/09/04/multi- datacenter-training-openais/

  31. [31]

    arXiv preprint arXiv:2401.10241 , year=

    P. Qi, X. Wan, G. Huang, and M. Lin, “Zero bubble pipeline parallelism,” 2023. [Online]. Available: https://arxiv.org/abs/2401.10241

  32. [32]

    Rasley, S

    J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” inProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 3505–3506. [Online]. A...

  33. [33]

    Inside the social network’s (datacenter) network,

    A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C. Snoeren, “Inside the social network’s (datacenter) network,”SIGCOMM Comput. Commun. Rev., vol. 45, no. 4, p. 123–137, Aug. 2015. [Online]. Available: https://doi.org/10.1145/2829988.2787472

  34. [34]

    Internet performance from facebook’s edge,

    B. Schlinker, I. Cunha, Y .-C. Chiu, S. Sundaresan, and E. Katz-Bassett, “Internet performance from facebook’s edge,” inProceedings of the Internet Measurement Conference, ser. IMC ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 179–194. [Online]. Available: https://doi.org/10.1145/3355369.3355567

  35. [35]

    Robotron: Top-down network management at facebook scale,

    Y .-W. E. Sung, X. Tie, S. H. Wong, and H. Zeng, “Robotron: Top-down network management at facebook scale,” inProceedings of the 2016 ACM SIGCOMM Conference, ser. SIGCOMM ’16. New York, NY , USA: Association for Computing Machinery, 2016, p. 426–439. [Online]. Available: https://doi.org/10.1145/2934872.2934874

  36. [36]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,

    G. Team, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”

  37. [37]
  38. [38]

    Metis: Fast automatic distributed training on heterogeneous GPUs,

    T. Um, B. Oh, M. Kang, W.-Y . Lee, G. Kim, D. Kim, Y . Kim, M. Muzzammil, and M. Jeon, “Metis: Fast automatic distributed training on heterogeneous GPUs,” in2024 USENIX Annual Technical Conference (USENIX ATC 24). Santa Clara, CA: USENIX Association, Jul. 2024, pp. 563–578. [Online]. Available: https://www.usenix.org/ conference/atc24/presentation/um

  39. [39]

    Falcon: Pinpointing and mit- igating stragglers for large-scale hybrid-parallel training

    T. Wu, W. Wang, Y . Yu, S. Yang, W. Wu, Q. Duan, G. Yang, J. Wang, L. Qu, and L. Zhang, “Falcon: Pinpointing and mitigating stragglers for large-scale hybrid-parallel training,” 2024. [Online]. Available: https://arxiv.org/abs/2410.12588

  40. [40]

    A survey on data center networking (dcn): Infrastructure and operations,

    W. Xia, P. Zhao, Y . Wen, and H. Xie, “A survey on data center networking (dcn): Infrastructure and operations,”IEEE Communications Surveys & Tutorials, vol. 19, no. 1, pp. 640–656, 2017

  41. [41]

    Context parallelism for scalable million-token inference,

    A. Yang, J. Yang, A. Ibrahim, X. Xie, B. Tang, G. Sizov, J. Reizenstein, J. Park, and J. Huang, “Context parallelism for scalable million-token inference,” 2025. [Online]. Available: https://arxiv.org/abs/2411.01783

  42. [42]

    Holmes: Localizing irregularities in LLM training with mega- scale GPU clusters,

    Z. Yao, P. Hu, C. Miao, X. Jia, Z. Liang, Y . Xu, C. He, H. Lu, M. Chen, X. Li, Z. He, Y . Wang, X. Zou, and J. Jiang, “Holmes: Localizing irregularities in LLM training with mega- scale GPU clusters,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). Philadelphia, PA: USENIX Association, Apr. 2025, pp. 523–540. [Online]. A...

  43. [43]

    Tt-rec: Tensor train compression for deep learning recommendation models,

    C. Yin, B. Acun, X. Liu, and C.-J. Wu, “Tt-rec: Tensor train compression for deep learning recommendation models,” 2021. [Online]. Available: https://arxiv.org/abs/2101.11714

  44. [44]

    Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

    C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma, W. Liang, Y . He, Y . Wang, Y . Liu, and Y . Wei, “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Assoc...

  45. [45]

    Pytorch fsdp: Experiences on scaling fully sharded data parallel,

    Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Mathews, and S. Li, “Pytorch fsdp: Experiences on scaling fully sharded data parallel,”

  46. [46]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    [Online]. Available: https://arxiv.org/abs/2304.11277

  47. [47]

    Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning,

    L. Zheng, Z. Li, H. Zhang, Y . Zhuang, Z. Chen, Y . Huang, Y . Wang, Y . Xu, D. Zhuo, E. P. Xing, J. E. Gonzalez, and I. Stoica, “Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 559–578. ...