pith. sign in

arxiv: 2606.20374 · v1 · pith:V3ANHHQJnew · submitted 2026-06-18 · 💻 cs.DC

ARGUS: Production-Scale Tracing and Performance Diagnosis for over 10,000-GPU Clusters

Pith reviewed 2026-06-26 15:39 UTC · model grok-4.3

classification 💻 cs.DC
keywords GPU cluster tracingperformance diagnosisLLM trainingdistributed systemslow-overhead monitoringfail-slow detectionkernel event compressionproduction observability
0
0 comments X

The pith

ARGUS provides always-on fine-grained tracing for over 10,000-GPU clusters at under 2% overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARGUS as a tracing system designed for large-scale LLM training on production GPU clusters. It establishes that decomposing observations into CPU call stacks, framework semantics, and GPU kernel executions allows continuous data collection without the high costs of existing profilers. This approach includes a unified pipeline that compresses kernel events dramatically while enabling progressive diagnosis to isolate issues at different levels. Deployment results over six months demonstrate its use for detecting slowdowns and optimizing performance in real clusters. A sympathetic reader would care because current monitoring either lacks the needed detail or imposes overheads that prevent always-on use at this scale.

Core claim

ARGUS decomposes observation along the training call hierarchy into CPU call stacks, framework semantics, and GPU kernel execution, with always-on collection under a combined overhead of less than 2%. It builds a unified data pipeline and compresses raw kernel events by approximately 3,700x from 10 MB to 2.7 KB per rank per step. Its progressive diagnosis framework automatically isolates anomalous windows, straggler ranks, and degraded kernels through iteration-time, phase-level, and kernel-level analysis. Deployed for over six months on a 10,000+ GPU production cluster, ARGUS has supported continuous fail-slow detection and performance optimization across anomalies such as compute straggler

What carries the argument

Hierarchical decomposition of training observations into CPU stacks, framework semantics, and GPU kernels, paired with a unified compression pipeline and staged diagnosis.

If this is right

  • Continuous fail-slow detection becomes practical without interrupting training runs.
  • Kernel-level insights allow targeted fixes for issues like stragglers and communication degradation.
  • Progressive analysis at iteration, phase, and kernel levels narrows down problems automatically.
  • Case studies confirm coverage of common anomalies including masked stragglers and JIT stalls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The compression ratios could support scaling the same system to clusters several times larger than 10,000 GPUs.
  • Integration of ARGUS data with existing coarse monitors might eliminate the need for separate low-detail tools.
  • Wider adoption could shorten debugging cycles for new model architectures by surfacing issues earlier in training.

Load-bearing premise

The hierarchical breakdown of CPU, framework, and kernel layers plus the compression and diagnosis pipeline will identify real production root causes without excessive overhead or loss of accuracy.

What would settle it

A production training run in which a documented performance anomaly occurs but ARGUS reports either more than 2% overhead or fails to flag the affected ranks and kernels.

Figures

Figures reproduced from arXiv: 2606.20374 by Clavis Chen, Jiasheng Zhou, Key Zhang, Leyi Ye, Longbin Zeng, Qinwei Yang, Ray Ying, Ruiming Lu.

Figure 1
Figure 1. Figure 1: Fail-slow in a 4096-GPU training job. Performance diagnosis for large-scale training encom￾passes two complementary aspects: localizing fail-slow faults, and identifying performance optimization opportunities. Un￾like fail-stop failures that halt execution, fail-slow refers to performance degradation in any component—such as GPU hardware, communication fabric, or host-side software—that drags down the enti… view at source ↗
Figure 2
Figure 2. Figure 2: The hierarchical structure of training execution. achieve fine granularity, always-on operation, and real-time cross-rank analysis at 10,000-GPU scale. Building such a system faces two core challenges. First, there is an inherent tension between fine-grained obser￾vation and low overhead. Comprehensive observation of the full training execution introduces significant runtime overhead. This not only slows t… view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of ARGUS. Diagnosis workflow—speed vs. depth. Performing fine￾grained search across all ranks, all kernels, and all time win￾dows is prohibitively expensive; outputting only anomalous time intervals cannot guide remediation. ARGUS chooses progressive diagnosis: multiple detection levels run in par￾allel, each covering a different granularity—from detecting anomalous time periods, to si… view at source ↗
Figure 5
Figure 5. Figure 5: Repetitive kernel execution patterns. for subsequent deep analysis. Second, it performs online statistical compression on kernel traces (§5.2), writing com￾pressed structured summaries to Metric Storage for real-time cross-rank comparison. The metrics path handles directly quantifiable observation results (phase duration, iteration time), writing them to Metric Storage via the Prometheus Remote Write proto… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of KDE-based clustering. mask positional and stream-level differences, leading to false positives in anomaly detection. Therefore, ARGUS first clus￾ters kernel durations to identify each mode, and then extracts statistics separately for each mode. Clustering method. The online clustering algorithm must satisfy three constraints. First, it must not require pre￾specifying the number of clusters… view at source ↗
Figure 7
Figure 7. Figure 7: Kernel statistics anomaly detection workflow. 6.1 Iteration-level Detection (L1) and Phase-level Attribution (L2) L1 continuously collects each rank’s iteration time series, running two complementary anomaly detection algorithms: sliding-window ratio-gated jitter detection for short-term fluctuations and spikes, while full-scan change-point detec￾tion for step-wise regression. Together they classify iterat… view at source ↗
Figure 8
Figure 8. Figure 8: Training time under different profiling configurations. the target training process via CUDA_INJECTION64_PATH en￾vironment variable. The CUDA runtime automatically loads this library during initialization, requiring no modifications to the training code or launch scripts. Framework seman￾tics instrumentation is provided as a Python package. The training framework enables it by calling the exposed API at cr… view at source ↗
Figure 9
Figure 9. Figure 9: Resident Set Size (RSS) over time. adds approximately 1%–2%, and all three combined remain within 2%. As model size grows and GPU computation domi￾nates, this overhead further diminishes. In contrast, PyTorch Profiler inflates iteration time by 20%–44% and eventually triggers out-of-memory failures due to unbounded trace ac￾cumulation, making it entirely impractical for production use. nsys fails to comple… view at source ↗
Figure 12
Figure 12. Figure 12: Case 2: L4 Perfetto trace of communication kernels. Rank 7 shows longer EDP-internal ReduceScatter and AllGather operations, illustrating network degradation in its own EDP group. bubble bubble PP Stage 3 Straggler PP Stage 2 PP Stage 1 PP Stage 0 bubble [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case 1: Grafana heatmap of per-rank maximum operator duration. The x-axis is the DP replica index, and the y-axis is the TP index. DP replicas 656 and 657 are outliers across both TP indices, with more than 150× degradation on compute-only operators. r0 r7 r8 r15 r0 r7 r8 r15 - 379.2k 17.6k 376.7k 379.2k - 395.3k 18.4k 17.6k 395.3k - 393.0k 376.7k 18.4k 393.0k - 0K 80K 160K 240K 320K W 1 dis t a n c e (a)… view at source ↗
Figure 11
Figure 11. Figure 11: Case 2: 𝑊1 distance matrices for three communication kernels. Ranks 0 and 8 belong to one EDP group; ranks 7 and 15 belong to another. Intra-group distances are small (17–23k), while inter-group distances are orders of magnitude larger (376k–2.73M), revealing systematic communication degradation in the EDP group containing ranks 7 and 15. than synchronization delay. After excluding the affected nodes, tra… view at source ↗
Figure 15
Figure 15. Figure 15: Case 4: L4 Perfetto trace of an anomalous step. In the PP group containing rank 688, backward-compute-mb7 becomes about 40× longer than normal. Sparse kernel launches indicate host-side blocking rather than GPU computation. 9.4 Case 4: FlashAttention JIT Compilation This case occurred in a 4,096-GPU VLM job with TP=4, PP=4, and EP=8. Training exhibited frequent iteration time spikes, with some steps exper… view at source ↗
Figure 16
Figure 16. Figure 16: Case 5: Heatmap of per-rank max duration. The x-axis is the DP replica index, and the y-axis is the PP stage index. (a) MLP shows extreme degradation at PP=7, DP=272–279 (ranks 10352– 10359, ∼5.7×). (b) The affected EP group spans DP replicas 256– 287 and shows shorter ReduceScatter durations because compute stragglers delay its entry into DP-level communication. artifacts persist across process restarts;… view at source ↗
read the original abstract

Large-scale LLM training requires always-on, fine-grained observability for effective performance diagnosis at scale. Coarse resource monitors alone cannot localize root causes, and fine-grained profilers incur prohibitive (5%-30%) overheads and massive trace volumes, making always-on deployment impractical in large production clusters. We propose ARGUS, a low-overhead, fine-grained, always-on tracing and real-time analysis system for training workloads in 10,000+ GPU-scale production clusters. ARGUS decomposes observation along the training call hierarchy into CPU call stacks, framework semantics, and GPU kernel execution, with always-on collection under a combined overhead of less than 2%. It builds a unified data pipeline and compresses raw kernel events by approximately 3,700x from 10 MB to 2.7 KB per rank per step. Its progressive diagnosis framework automatically isolates anomalous windows, straggler ranks, and degraded kernels through iteration-time, phase-level, and kernel-level analysis. Deployed for over six months on a 10,000+ GPU production cluster, ARGUS has supported continuous fail-slow detection and performance optimization. Our case studies further demonstrate its effectiveness across representative anomalies, including compute stragglers, link degradation, pipeline-bubble amplification, FlashAttention JIT stalls, and compute stragglers masked by communication symptoms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents ARGUS, a tracing and performance diagnosis system for LLM training on 10,000+ GPU clusters. It claims always-on fine-grained observability via hierarchical decomposition into CPU call stacks, framework semantics, and GPU kernels, achieving <2% combined overhead and ~3700x compression (10 MB to 2.7 KB per rank per step). A progressive diagnosis pipeline isolates anomalies at iteration, phase, and kernel levels. The system was deployed for over six months in production, supporting fail-slow detection and optimization, with case studies on compute stragglers, link degradation, pipeline bubbles, FlashAttention JIT stalls, and masked stragglers.

Significance. If the overhead, compression, and diagnostic fidelity claims hold under production workloads, ARGUS would represent a meaningful advance in enabling continuous, fine-grained monitoring for large-scale distributed training where existing profilers are too costly. The reported six-month deployment on a real 10k+ GPU cluster and the listed case studies provide practical evidence of utility beyond synthetic benchmarks.

major comments (2)
  1. [Abstract] Abstract: The central claim that the 3700x compression and hierarchical decomposition (CPU stacks / framework semantics / GPU kernels) plus progressive diagnosis pipeline 'automatically isolates' root causes without material loss of fidelity is load-bearing but unsupported by any quantitative metric (e.g., root-cause recall against raw traces, false-positive rates, or comparison to unaggregated baselines). The five case studies are listed but supply no precision, recall, or error-bar data.
  2. [Abstract] Abstract (deployment paragraph): The statement that ARGUS 'has supported continuous fail-slow detection and performance optimization' for six months is presented without any aggregate statistics on detected anomalies, false-alarm rates, or before/after performance improvements across the cluster, making it impossible to assess whether the system meets its diagnostic goals at scale.
minor comments (1)
  1. [Abstract] The abstract repeatedly uses 'stragglers' in two separate case studies without clarifying whether these are distinct phenomena or a duplication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative support in the abstract. We address each major comment below and commit to revisions that add the requested metrics where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the 3700x compression and hierarchical decomposition (CPU stacks / framework semantics / GPU kernels) plus progressive diagnosis pipeline 'automatically isolates' root causes without material loss of fidelity is load-bearing but unsupported by any quantitative metric (e.g., root-cause recall against raw traces, false-positive rates, or comparison to unaggregated baselines). The five case studies are listed but supply no precision, recall, or error-bar data.

    Authors: We agree that the abstract's claims on automatic root-cause isolation would be strengthened by quantitative metrics. The case studies provide concrete demonstrations of the progressive diagnosis pipeline isolating issues at different levels, but they are presented qualitatively. In the revision we will add a dedicated evaluation subsection that reports root-cause recall and false-positive rates by comparing ARGUS outputs against manually labeled ground truth from the raw traces in the five case studies, along with error bars where multiple runs are available. revision: yes

  2. Referee: [Abstract] Abstract (deployment paragraph): The statement that ARGUS 'has supported continuous fail-slow detection and performance optimization' for six months is presented without any aggregate statistics on detected anomalies, false-alarm rates, or before/after performance improvements across the cluster, making it impossible to assess whether the system meets its diagnostic goals at scale.

    Authors: The six-month deployment claim is currently stated qualitatively based on operational experience. We acknowledge that aggregate statistics would allow readers to better evaluate impact at scale. In the revised manuscript we will include available deployment data, such as the total number of anomalies flagged, observed false-alarm incidents, and documented performance gains from optimizations informed by ARGUS. revision: yes

Circularity Check

0 steps flagged

No circularity: systems description with no derivations or self-referential predictions

full rationale

The paper is a systems-engineering description of ARGUS (architecture, hierarchical decomposition into CPU stacks/framework/GPU kernels, compression pipeline, progressive diagnosis, and six-month deployment on a 10k+ GPU cluster). It reports measured overhead (<2%), compression ratio (~3700x), and case studies but contains no equations, fitted parameters, uniqueness theorems, or predictions that reduce to their own inputs by construction. No self-citation load-bearing steps appear in the provided text. The central claims rest on empirical deployment and qualitative case studies rather than any mathematical derivation chain, so no circularity is present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering systems paper; the abstract contains no mathematical free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5791 in / 1081 out tokens · 26196 ms · 2026-06-26T15:39:19.580253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 4 linked inside Pith

  1. [1]

    AI at Meta. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

  2. [2]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems 33 (NeurIPS)

  3. [3]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  4. [4]

    Cloud Native Computing Foundation. 2024. Prometheus: Monitoring system and time series database.https://prometheus.io

  5. [5]

    Weihao Cui, Ji Zhang, Han Zhao, Chao Liu, Jian Sha, Bo Sang, Bing- sheng He, Minyi Guo, and Quan Chen. 2026. FLARE: Anomaly Diag- nostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale. In23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26). USENIX Association, Renton, WA, 1021– 1035

  6. [6]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

  7. [7]

    InAdvances in Neural Information Processing Systems 35 (NeurIPS)

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems 35 (NeurIPS)

  8. [8]

    Dao-AILab. 2025. FlashAttention-4: CuTe DSL JIT Compilation and Caching.https://github.com/Dao-AILab/flash-attention

  9. [9]

    Datadog. 2024. Vector: A lightweight, ultra-fast tool for building observability pipelines.https://vector.dev

  10. [10]

    Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, and Minlan Yu. 2025. Minder: Faulty Machine Detection for Large-scale Distributed Model Training. In22nd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 25). USENI...

  11. [11]

    Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, Yang Bai, Shuguang Wang, Wencong Xiao, Jianxi Ye, Minlan Yu, and Hong Xu. 2025. Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys...

  12. [12]

    Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, Gang Lu, Yu Guan, Ennan Zhai, Wencong Xiao, Hanyu Zhao, Man Yuan, Siran Yang, Xiang Li, Jiamang Wang, Rui Men, Jianwei Zhang, Chang Zhou, Dennis Cai, Yuan Xie, and Binzhang Fu. 2025. Enhancing Large-Scale AI Training Efficiency: The C4 Solution...

  13. [13]

    Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao W...

  14. [14]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  15. [15]

    Ben Frederickson. 2024. py-spy: Sampling profiler for Python programs. https://github.com/benfred/py-spy

  16. [16]

    Google. 2024. Perfetto: System-wide profiling for Android and Linux. https://perfetto.dev

  17. [17]

    Grafana Labs. 2024. Grafana: The open observability platform.https: //grafana.com

  18. [18]

    Yu Guan, Zhiyu Yin, Haoyu Chen, Sheng Cheng, Chaojie Yang, Kun Qian, Tianyin Xu, Pengcheng Zhang, Yang Zhang, Hanyu Zhao, Yong Li, Dennis Cai, and Ennan Zhai. 2026. EROICA: Online Performance Troubleshooting for Large-scale Model Training. In23rd USENIX Sym- posium on Networked Systems Design and Implementation (NSDI 26). USENIX Association, Renton, WA, 1113–1130

  19. [19]

    Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. 2024. Characterization of Large Language Model Development in the Datacenter. In21st USENIX Sym- posium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA, 709–729

  20. [20]

    Songlin Huang and Chenshu Wu. 2025. Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing. In19th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, Boston, MA, 331–344

  21. [21]

    Le, Yonghui Wu, and Zhifeng Chen

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neu- ral Networks Using Pipeline Parallelism. InAdvances in Neural In- formation Processing Systems 32 (NeurIPS). Curran Associates, Inc., Vancouver, BC, Canada, 103–112

  22. [22]

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, 13 Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xi...

  23. [23]

    Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361(2020)

  24. [24]

    Apostolos Kokolis, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zachary DeVito, Shubho Sengupta, Kalyan Saladi, and Carole-Jean Wu. 2024. Revisiting Reliability in Large-Scale Machine Learning Research Clusters.arXiv preprint arXiv:2410.21680 (2024)

  25. [25]

    Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu, Aurojit Panda, and Jinyang Li. 2025. Understanding Stragglers in Large Model Train- ing Using What-if Analysis. In19th USENIX Symposium on Operating Systems Design and Implementation ...

  26. [26]

    Meta Research. 2024. Holistic Trace Analysis: A Library to Analyze PyTorch Traces.https://github.com/facebookresearch/ HolisticTraceAnalysis

  27. [27]

    Devanur, Gregory R

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. InProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). Association for Computing Machinery, New York, NY, USA, 1–15

  28. [28]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProceedings of the International Conference for High...

  29. [29]

    NVIDIA. 2024. CUDA C++ Programming Guide: Asynchronous Concurrent Execution and Events.https://docs.nvidia.com/cuda/ cuda-c-programming-guide/

  30. [30]

    NVIDIA. 2024. CUPTI: CUDA Profiling Tools Interface.https: //developer.nvidia.com/cupti

  31. [31]

    NVIDIA. 2024. NCCL: NVIDIA Collective Communications Library. https://developer.nvidia.com/nccl

  32. [32]

    NVIDIA. 2025. NVIDIA CUTLASS Documentation: CuTe DSL Introduc- tion.https://docs.nvidia.com/cutlass/4.4.1/media/docs/pythonDSL/ cute_dsl_general/dsl_introduction.html

  33. [33]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad- bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, Hi...

  34. [34]

    PyTorch. 2025. Kineto: A CPU+GPU Profiling Library.https://github. com/pytorch/kineto

  35. [35]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

  36. [36]

    InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis

    ZeRO: Memory Optimizations Toward Training Trillion Param- eter Models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

  37. [37]

    Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. BLOOM: A 176B-Parameter Open- Access Multilingual Language Model.arXiv preprint arXiv:2211.05100 (2022)

  38. [38]

    David W. Scott. 1992.Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons

  39. [39]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. InarXiv preprint arXiv:1909.08053

  40. [40]

    Tencent Hunyuan Team. 2024. HunYuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent.arXiv preprint arXiv:2411.02265(2024)

  41. [41]

    John W. Tukey. 1977.Exploratory Data Analysis. Addison-Wesley

  42. [42]

    2009.Optimal Transport: Old and New

    Cédric Villani. 2009.Optimal Transport: Old and New. Springer

  43. [43]

    Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xi- aoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zhe...

  44. [44]

    Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang

  45. [45]

    In2025 USENIX Annual Technical Conference (USENIX ATC 25)

    GREYHOUND: Hunting Fail-Slows in Hybrid-Parallel Training at Scale. In2025 USENIX Annual Technical Conference (USENIX ATC 25). USENIX Association, Boston, MA, 731–747

  46. [46]

    Zhiyi Yao, Pengbo Hu, Congcong Miao, Xuya Jia, Zuning Liang, Yue- dong Xu, Chunzhi He, Hao Lu, Mingzhuo Chen, Xiang Li, Zekun He, Yachen Wang, Xianneng Zou, and Juncheng Jiang. 2025. Holmes: Lo- calizing Irregularities in LLM Training with Mega-scale GPU Clusters. In22nd USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 25). USENIX A...

  47. [47]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.Proceedings of the VLDB Endowment16, 12 ...

  48. [48]

    take buffer

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Au- tomating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX As...