pith. machine review for the scientific record. sign in

arxiv: 2605.06544 · v1 · submitted 2026-05-07 · 💻 cs.DC · cs.NI

Recognition: unknown

CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:04 UTC · model grok-4.3

classification 💻 cs.DC cs.NI
keywords LLM infrastructuretrace-based benchmarkingexecution tracescompute-communication overlaptraining frameworksinterconnect bandwidthperformance metricsparallelization efficiency
0
0 comments X

The pith

Trace-based benchmarking for LLM infrastructure reveals that higher compute-communication overlap can produce longer training steps and that framework choices can create up to 3x gaps on identical hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CCL-Bench to record reusable execution traces, workload cards, and launch scripts for every ML workload instead of publishing only aggregate performance numbers. A toolkit then derives fine-grained metrics on compute efficiency, memory use, and communication overlap from these traces. Readers would care because this evidence supports three specific observations that summary benchmarks cannot explain: overlap can mask inefficient parallelization, TPU interconnect bandwidth increases deliver larger gains than equivalent GPU increases on small and medium workloads, and best-tuned configurations across frameworks can differ by up to 3x on the same hardware. The approach packages data so the community can extend the analysis to new workloads and configurations.

Core claim

CCL-Bench records an execution trace, a YAML workload card, and launch scripts for each contributed data point, then applies a community-extensible toolkit to compute fine-grained compute, memory, and communication efficiency metrics from the trace. This evidence shows that higher compute-communication overlap can coincide with longer training step time and reveal inefficient parallelization choices, that doubling TPU interconnect bandwidth yields substantially higher end-to-end step-time improvement than doubling GPU interconnect bandwidth on small and medium workloads, and that the best-tuned configuration on one training framework can run up to 3x slower than the best-tuned configuration

What carries the argument

The toolkit that processes execution traces to compute fine-grained compute, memory, and communication efficiency metrics from reusable trace packages.

If this is right

  • Higher compute-communication overlap can indicate inefficient parallelization when it coincides with longer step times.
  • Doubling interconnect bandwidth produces larger step-time gains on TPUs than on GPUs for small and medium workloads.
  • Training frameworks can differ by up to 3x in end-to-end performance even after best tuning on identical hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Infrastructure teams could adopt trace collection as a standard practice to diagnose parallelization issues before scaling to larger workloads.
  • Comparisons across hardware generations might become more reliable if benchmarks always ship the underlying traces rather than derived numbers alone.
  • Automated tools that suggest parallelism plans from trace patterns could reduce reliance on manual tuning.

Load-bearing premise

The collected execution traces and the toolkit's metric computations faithfully represent true performance characteristics without bias from tracing overhead or workload selection.

What would settle it

A replication on the same workloads using an independent tracing method that finds the three reported performance relationships do not hold or reverse.

Figures

Figures reproduced from arXiv: 2605.06544 by Abhishek Vijaya Kumar, Arjun Devraj, Atharv Sonwane, Bhaskar Kataria, Byungsoo Oh, Emaad Manzoor, Eric Ding, Jelena Gvero, Kaiwen Guo, Lindsey Bowen, Rachee Singh.

Figure 1
Figure 1. Figure 1: CCL-Bench overview: a standardized trace plus workload card recording each run, a metric toolkit computing fine-grained metrics, and downstream analysis and optimization plug-ins. Outcome vs. Explanation. The common thread across these limitations is that current benchmarks report only outcomes but not explanations. An outcome tells the reader which infrastructure combina￾tion is fastest on a given workloa… view at source ↗
Figure 2
Figure 2. Figure 2: LLM infrastructure evaluated by CCL￾Bench. The workload defines what computation must happen. The infrastructure—hardware plat￾form and software—determines how it is executed. Workload specifies the computation task to be per￾formed, i.e., the model family and size, quantization, the phase (training or inference), the batch size, and the sequence length. CCL-Bench targets open-weight models that can be hos… view at source ↗
Figure 3
Figure 3. Figure 3: (a) MFU vs. step time and compute-comm overlap vs. step time on A100 GPU (Perlmutter). Each point is one CCL-Bench entry. (b) Step-time and communication traffic-volume break-down for WL5 DeepSeek￾V3-16B EP=4 and EP=8 MoE training runs (TP=4, DP=2, PP=1). infrastructure spans communication libraries (NCCL, MSCCL++, XLA collectives) and training and serving engines (Megatron-LM, TorchTitan, MaxText, vLLM, SGLang) view at source ↗
Figure 4
Figure 4. Figure 4: CCL-Bench trace-driven what-if pipeline: empirical traces feed a network simulator that estimates step time and the utility of doubling a hardware resource. We feed these execution graphs to Astra-Sim [76], a distributed ML workload simulator that models net￾work topologies with configurable bandwidth, latency, and collective al￾gorithms. Using Astra-Sim, we replay the computation nodes at their empiri￾cal… view at source ↗
Figure 5
Figure 5. Figure 5: Utility metrics for scale-out bandwidth, scale-up/ICI bandwidth, and scale-up domain size on Perlmutter A100 Slingshot and TPUv6e Torus. We are not able to run WL6,7 on TPUs due to resource limits. Claim 2: Doubling TPU ICI bandwidth yields higher end-to-end utility than doubling GPU scale-up bandwidth for small workloads (up to 100× better for inference and 22× for training). For medium-to-large GPU train… view at source ↗
Figure 6
Figure 6. Figure 6: CCL-Search loop. update_policy is an LLM call. Other steps are local/tool execution. How CCL-Search works. CCL-Search is an LLM-agent￾based optimizer that iteratively explores the configuration space for a distributed LLM workload ( view at source ↗
Figure 7
Figure 7. Figure 7: CCL-Search results. Bottom panels show explored TP, DP, PP, micro-batch size, and activation￾checkpointing choices. Grayed-out boxes indicate failed runs. (a) Step-time objective: CCL-Search reduces step time by up to 8× on TorchTitan and 19× on Megatron-LM within 15 iterations using Score = −avg_step_time. (b) Composite objective: CCL-Search finds the Pareto-frontier using Score = w × T0/T + (1 − w) × N0/… view at source ↗
Figure 8
Figure 8. Figure 8: CCL-Bench user interface. The interface provides a dashboard for performance ranking, view at source ↗
Figure 9
Figure 9. Figure 9: Example cross-system comparisons on Qwen3-4B (WL1), Llama-3.1-8B (WL2), and view at source ↗
Figure 10
Figure 10. Figure 10: Example cross-system comparisons on Qwen3-4B (WL1), Llama-3.1-8B (WL2), and view at source ↗
Figure 11
Figure 11. Figure 11: Supplementary TPU v6e tensor-parallel sweep for Llama-3.1-8B batch-128/input-1024 view at source ↗
Figure 12
Figure 12. Figure 12: GPU vs. TPU holistic comparison across WL1–WL5. Each bar is the average over all view at source ↗
Figure 13
Figure 13. Figure 13: Cluster architecture sweep for WL7 (TP=4, PP=8, DP=8, EP=32) DeepSeek-V3-236B, view at source ↗
Figure 15
Figure 15. Figure 15: Composite objective: Score = w × T0/T + (1 − w) × N0/N. WL5 DeepSeek-V3- 16B on Perlmutter. This run uses a different seed policy compared to view at source ↗
read the original abstract

Evaluative claims about LLM infrastructure -- ``workload X is fastest on hardware Y with software Z'' -- depend on a complex configuration space spanning hardware accelerators, interconnect bandwidth, software frameworks, parallelism plans, and communication libraries. Current infrastructure evaluation benchmarks publish a small set of end-to-end numbers that do not explain why one configuration outperforms another. We present CCL-Bench, a trace-based benchmark that addresses the limitations of existing benchmarks by recording reusable evidence for every ML workload. Each contributed data point in CCL-Bench packages an execution trace, a YAML workload card, and the launch scripts. We have developed a community-extensible toolkit to compute fine-grained compute, memory, and communication efficiency metrics from this evidence. Using CCL-Bench, we surface three claims that summary-statistic benchmarks cannot support: (i) higher compute-communication overlap can coincide with longer training step time and reveal inefficient parallelization choices, (ii) doubling TPU interconnect bandwidth yields a much higher end-to-end improvement in step time than doubling GPU interconnect bandwidth on small and medium workloads, and (iii) the best-tuned configuration on one training framework can run up to 3$\times$ slower than the best-tuned configuration on a peer framework on identical hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CCL-Bench 1.0, a trace-based benchmark for LLM infrastructure evaluation. Each data point consists of an execution trace, a YAML workload card, and launch scripts. A community-extensible toolkit computes fine-grained compute, memory, and communication efficiency metrics from these traces. The authors use the benchmark to surface three claims not supportable by summary-statistic benchmarks: (i) higher compute-communication overlap can coincide with longer training step time and indicate inefficient parallelization, (ii) doubling TPU interconnect bandwidth yields substantially higher step-time improvement than doubling GPU interconnect bandwidth on small and medium workloads, and (iii) the best-tuned configuration on one training framework can be up to 3× slower than the best-tuned configuration on a peer framework on identical hardware.

Significance. If the traces are made publicly available, the toolkit is shown to be extensible, and the metrics are validated, CCL-Bench could meaningfully advance reproducible evaluation of distributed LLM training by exposing why one configuration outperforms another rather than reporting only end-to-end numbers. The packaging of reusable evidence (trace + card + script) directly addresses a documented limitation of existing infrastructure benchmarks.

major comments (2)
  1. [Abstract] Abstract: the three claims are presented as observations enabled by CCL-Bench, yet no workloads, configurations, quantitative results, or error bars are supplied to support them. This absence prevents assessment of whether the evidence actually sustains the claims, especially claim (iii) on the 3× slowdown.
  2. [Benchmark Design] Benchmark design and toolkit description: the manuscript provides no details on how the fine-grained metrics are computed from traces, no validation against ground-truth performance counters, and no analysis of tracing overhead or workload-selection bias. These omissions directly affect the weakest assumption that the collected traces faithfully represent true performance characteristics.
minor comments (1)
  1. [Abstract] Abstract: the LaTeX fragment “3$×$” should be rendered consistently as “3×” in the final PDF; check for similar formatting issues in any tables or figures that report speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the three claims are presented as observations enabled by CCL-Bench, yet no workloads, configurations, quantitative results, or error bars are supplied to support them. This absence prevents assessment of whether the evidence actually sustains the claims, especially claim (iii) on the 3× slowdown.

    Authors: We agree that the abstract, in its current form, presents the three claims at a high level without the supporting quantitative details. The revised manuscript will incorporate concise quantitative summaries directly into the abstract (e.g., the specific workloads, hardware configurations, measured step-time deltas with error bars for claim (ii), and the exact frameworks, hardware, and workload yielding the 3× difference for claim (iii)). The full supporting data, including all workloads, configurations, and error bars, will also be expanded and clearly cross-referenced in the evaluation section. revision: yes

  2. Referee: [Benchmark Design] Benchmark design and toolkit description: the manuscript provides no details on how the fine-grained metrics are computed from traces, no validation against ground-truth performance counters, and no analysis of tracing overhead or workload-selection bias. These omissions directly affect the weakest assumption that the collected traces faithfully represent true performance characteristics.

    Authors: We acknowledge that the current manuscript lacks these details. The revised version will add a dedicated subsection describing the exact formulas, algorithms, and trace-parsing logic used to compute each fine-grained metric (compute, memory, and communication efficiency). We will also include a validation subsection that compares toolkit-derived metrics against ground-truth hardware performance counters, report measured tracing overhead as a percentage of step time on representative workloads, and add an analysis of workload-selection criteria together with a discussion of potential biases and how they were mitigated. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with no derivations or self-referential predictions

full rationale

The paper introduces CCL-Bench as a trace-collection and metric-computation toolkit for LLM infrastructure evaluation. Its central claims are direct observations computed from collected execution traces on specific workloads and hardware configurations; no equations, fitted parameters, or predictions are derived from prior results within the paper. The three surfaced claims (overlap vs. step time, TPU vs. GPU bandwidth scaling, framework tuning differences) are presented as empirical findings enabled by the benchmark rather than as universally quantified theorems. No self-citation chains, ansatzes, or uniqueness theorems are invoked to justify the methodology or results. The work is self-contained as a reusable artifact whose validity rests on external reproducibility of the traces and scripts, not on internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the assumption that traces capture execution events with sufficient fidelity for metric computation; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Execution traces accurately record all relevant compute, memory, and communication events without significant overhead or omission.
    Invoked when the toolkit derives fine-grained efficiency metrics from the evidence packages.

pith-pipeline@v0.9.0 · 5562 in / 1300 out tokens · 84738 ms · 2026-05-08T05:04:50.754141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Fathom: Reference workloads for modern deep learning methods

    Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Fathom: Reference workloads for modern deep learning methods. In2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1–10. IEEE, 2016

  2. [2]

    Vidur: A large-scale simulation framework for llm inference.Proceedings of Machine Learning and Systems, 6:351–366, 2024

    Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for llm inference.Proceedings of Machine Learning and Systems, 6:351–366, 2024

  3. [3]

    MaxText: A Simple, Performant and Scalable JAX LLM

    AI Hypercomputer. MaxText: A Simple, Performant and Scalable JAX LLM. https://github.com/ AI-Hypercomputer/maxtext, 2026. Accessed: 2026-05-05

  4. [4]

    Amazon web services (aws)

    Amazon Web Services. Amazon web services (aws). https://aws.amazon.com/, 2026. Accessed: 2026-05-06

  5. [5]

    Accelerating generative AI: How AMD Instinct GPUs delivered breakthrough effi- ciency and scalability in MLPerf Inference v5.1

    AMD. Accelerating generative AI: How AMD Instinct GPUs delivered breakthrough effi- ciency and scalability in MLPerf Inference v5.1. https://www.amd.com/en/blogs/2025/ accelerating-generative-ai-how-instinct-gpus-delivered.html , 2025. Accessed 2026-04- 26

  6. [6]

    Accelerating AI training: How AMD Instinct MI350 Series GPUs delivered breakthrough performance and efficiency in MLPerf Training v5.1

    AMD. Accelerating AI training: How AMD Instinct MI350 Series GPUs delivered breakthrough performance and efficiency in MLPerf Training v5.1. https://www.amd.com/en/blogs/2025/ accelerating-ai-training.html, 2025. Accessed 2026-04-26

  7. [7]

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael ...

  8. [8]

    doi: 10.1145/3620665.3640366

  9. [9]

    LLM Leaderboard: Comparison of over 100 AI Models

    Artificial Analysis. LLM Leaderboard: Comparison of over 100 AI Models. https:// artificialanalysis.ai/leaderboards/models, 2026. Accessed: 2026-05-05

  10. [10]

    MINT: Securely Mitigating Rowhammer with a Minimalist in-DRAM Tracker ,

    Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, and Minsoo Rhu. vTrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 153–167. IEEE, 2024. doi: 10.1109/MICRO61859.2024.00021

  11. [11]

    Jahs-bench-201: A foundation for research on joint architecture and hyperparameter search.Advances in Neural Information Processing Systems, 35:38788–38802, 2022

    Archit Bansal, Danny Stoll, Maciej Janowski, Arber Zela, and Frank Hutter. Jahs-bench-201: A foundation for research on joint architecture and hyperparameter search.Advances in Neural Information Processing Systems, 35:38788–38802, 2022

  12. [12]

    Bar- barians at the gate: How ai is upending systems research.arXiv 13 Georgios Liargkovas, Mihir Nitin Joshi, Hubertus Franke, and Kostis Kaffes preprint arXiv:2510.06189, 2025

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Barbarians at the gate: How ai is upending systems research.arXiv preprint arXiv:2510.06189, 2025

  13. [13]

    Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale

    Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, and Jongse Park. Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale. In2024 IEEE International Symposium on Workload Characterization (IISWC), pages 15–29. IEEE, 2024. 10

  14. [14]

    Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

  15. [15]

    Jae-Won Chung, Jeff J Ma, Ruofan Wu, Jiachen Liu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, and Mosharaf Chowdhury. The ml. energy benchmark: Toward automated inference energy measurement and optimization.arXiv preprint arXiv:2505.06371, 2025

  16. [16]

    DAWNBench: An end-to-end deep learning benchmark and competition

    Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Christopher Ré, and Matei Zaharia. DAWNBench: An end-to-end deep learning benchmark and competition. InNIPS ML Systems Workshop, 2017

  17. [17]

    Efficient allreduce with stragglers, 2025

    Arjun Devraj, Eric Ding, Abhishek Vijaya Kumar, Robert Kleinberg, and Rachee Singh. Efficient allreduce with stragglers, 2025. URLhttps://arxiv.org/abs/2505.23523

  18. [18]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  19. [19]

    Distributed training of large language models on aws trainium

    Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Mohammad El-Shabani, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, and Yida Wang. Distributed training of large language models on aws trainium. InProceedings of the 2024 ACM Symposium on Cloud Computing, pages 961–976, 2024

  20. [20]

    Introducing Cloud TPU v5p and AI Hypercom- puter

    Google Cloud. Introducing Cloud TPU v5p and AI Hypercom- puter. https://cloud.google.com/blog/products/ai-machine-learning/ introducing-cloud-tpu-v5p-and-ai-hypercomputer, 2023. Accessed 2026-04-26

  21. [21]

    From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer

    Google Cloud. From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer. https://cloud.google.com/blog/products/compute/ ai-hypercomputer-inference-updates-for-google-cloud-tpu-and-gpu , 2025. Accessed 2026-04-26

  22. [22]

    Tensor processing units (tpus)

    Google Cloud. Tensor processing units (tpus). https://cloud.google.com/tpu?hl=en, 2026. Ac- cessed: 2026-05-06

  23. [23]

    Cloud TPU v6e

    Google Cloud. Cloud TPU v6e. https://cloud.google.com/tpu/docs/v6e, 2026. Accessed: 2026- 05-06

  24. [24]

    Campbell

    Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H. Campbell. TicTac: Accelerating distributed deep learning with communication scheduling.arXiv preprint arXiv:1803.03288, 2018. URL https: //arxiv.org/abs/1803.03288

  25. [25]

    LLM-Perf Leaderboard

    Hugging Face Optimum Team. LLM-Perf Leaderboard. https://huggingface.co/spaces/optimum/ llm-perf-leaderboard, 2024. Accessed: 2026-05-05

  26. [26]

    MSCCL++: Rethinking GPU communication abstractions for AI inference

    Changho Hwang, Peng Cheng, Roshan Dathathri, Abhinav Jangda, Saeed Maleki, Madan Musuvathi, Olli Saarikivi, Aashaka Shah, Ziyue Yang, Binyang Li, Caio Rocha, Qinghua Zhou, Mahdieh Ghazimirsaeed, Sreevatsa Anantharamu, and Jithin Jose. MSCCL++: Rethinking GPU communication abstractions for AI inference. InACM International Conference on Architectural Suppo...

  27. [27]

    Beyond data and model parallelism for deep neural networks

    Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks. InMLSys, 2019

  28. [28]

    Moe-cap: Benchmarking cost, accuracy and performance of sparse mixture-of-experts systems.arXiv preprint arXiv:2412.07067, 2024

    Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, et al. Moe-cap: Benchmarking cost, accuracy and performance of sparse mixture-of-experts systems.arXiv preprint arXiv:2412.07067, 2024

  29. [29]

    Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. InProceedings of the 50th Annual Internation...

  30. [30]

    doi: 10.1145/3579371.3589350

  31. [31]

    Technology-driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

    John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. Technology-driven, highly-scalable dragonfly topology.ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008

  32. [32]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 11

  33. [33]

    XSP: Across- stack profiling and analysis of machine learning models on GPUs

    Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, and Wen-Mei Hwu. XSP: Across- stack profiling and analysis of machine learning models on GPUs. InProceedings of the 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 326–327. IEEE, 2020. doi: 10.1109/IPDPS47924.2020.00042

  34. [34]

    Lumos: Efficient performance modeling and estimation for large-scale llm training.Proceedings of Machine Learning and Systems, 7, 2025

    Mingyu Liang, Hiwot T Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, and Christina Delimitrou. Lumos: Efficient performance modeling and estimation for large-scale llm training.Proceedings of Machine Learning and Systems, 7, 2025

  35. [35]

    June 7, 2025.DOI:10.48550/arXiv.2410.06511

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, et al. Torchtitan: One-stop pytorch native solution for production ready llm pre-training.arXiv preprint arXiv:2410.06511, 2024

  36. [36]

    Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

    Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, et al. Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

  37. [37]

    MTIA v1: Meta’s first-generation AI inference accelerator

    Meta. MTIA v1: Meta’s first-generation AI inference accelerator. https://ai.meta.com/blog/ meta-training-inference-accelerator-AI-MTIA/, 2023. Accessed: 2026-05-04

  38. [38]

    Holistic trace analysis

    Meta Platforms, Inc. Holistic trace analysis. https://hta.readthedocs.io/en/latest/, 2024. Accessed 2026-04-22

  39. [39]

    Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism

    Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism. InVLDB, 2023

  40. [40]

    MSCCL++: A GPU-driven communication stack for scalable AI applications

    Microsoft. MSCCL++: A GPU-driven communication stack for scalable AI applications. https: //github.com/microsoft/mscclpp, 2026. Accessed 2026-04-26

  41. [41]

    Microsoft azure.https://azure.microsoft.com/en-us, 2026

    Microsoft. Microsoft azure.https://azure.microsoft.com/en-us, 2026. Accessed: 2026-05-06

  42. [42]

    MLCommons. Chakra. https://mlcommons.org/working-groups/research/chakra/, 2026. Ac- cessed 2026-04-22

  43. [43]

    Efficient large-scale language model training on gpu clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High ...

  44. [44]

    Perlmutter architecture

    National Energy Research Scientific Computing Center. Perlmutter architecture. https://docs.nersc. gov/systems/perlmutter/architecture/, 2026. Accessed: 2026-05-06

  45. [45]

    H100 GPUs Set Standard for Gen AI in Debut MLPerf Benchmark

    NVIDIA. H100 GPUs Set Standard for Gen AI in Debut MLPerf Benchmark. https://blogs.nvidia. com/blog/generative-ai-debut-mlperf/, 2023. Accessed 2026-04-26

  46. [46]

    NVIDIA Blackwell takes pole position in latest MLPerf inference results

    NVIDIA. NVIDIA Blackwell takes pole position in latest MLPerf inference results. https://blogs. nvidia.com/blog/blackwell-mlperf-inference/, 2025. Accessed 2026-04-26

  47. [47]

    NVIDIA Blackwell Ultra sets the bar in new MLPerf inference benchmark

    NVIDIA. NVIDIA Blackwell Ultra sets the bar in new MLPerf inference benchmark. https://blogs. nvidia.com/blog/mlperf-inference-blackwell-ultra/, 2025. Accessed 2026-04-26

  48. [48]

    Understanding NCCL tuning to accelerate GPU-to- GPU communication

    NVIDIA. Understanding NCCL tuning to accelerate GPU-to- GPU communication. https://developer.nvidia.com/blog/ understanding-nccl-tuning-to-accelerate-gpu-to-gpu-communication/ , 2025. Accessed 2026-04-26

  49. [49]

    NVIDIA wins every MLPerf Training v5.1 benchmark

    NVIDIA. NVIDIA wins every MLPerf Training v5.1 benchmark. https://blogs.nvidia.com/blog/ mlperf-training-benchmark-blackwell-ultra/, 2025. Accessed 2026-04-26

  50. [50]

    Nvidia collective communication library (NCCL) documentation

    NVIDIA. Nvidia collective communication library (NCCL) documentation. https://docs.nvidia. com/deeplearning/nccl/user-guide/docs/, 2026. Accessed: 2026-05-04

  51. [51]

    NCCL Tests.https://github.com/NVIDIA/nccl-tests, 2026

    NVIDIA. NCCL Tests.https://github.com/NVIDIA/nccl-tests, 2026. Accessed 2026-04-26

  52. [52]

    NVIDIA Nsight Systems

    NVIDIA. NVIDIA Nsight Systems. https://docs.nvidia.com/nsight-systems/, 2026. Accessed: 2026-05-06

  53. [53]

    NVIDIA NVLink and NVLink Switch

    NVIDIA. NVIDIA NVLink and NVLink Switch. https://www.nvidia.com/en-us/data-center/ nvlink/, 2026. Accessed: 2026-05-06. 12

  54. [54]

    XLA Operation Semantics

    OpenXLA Project. XLA Operation Semantics. https://openxla.org/xla/operation_semantics,

  55. [55]

    Accessed: 2026-05-06

  56. [56]

    Profiling jax computations with xprof

    OpenXLA Project. Profiling jax computations with xprof. https://openxla.org/xprof/jax_ profiling, 2026. Accessed: 2026-05-05

  57. [57]

    Splitwise: Efficient generative llm inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

  58. [58]

    Accelerating PyTorch with CUDA Graphs

    PyTorch. Accelerating PyTorch with CUDA Graphs. https://pytorch.org/blog/ accelerating-pytorch-with-cuda-graphs/, 2021. Accessed: 2026-05-05

  59. [59]

    Libkineto readme

    PyTorch Foundation. Libkineto readme. https://github.com/pytorch/kineto/blob/main/ libkineto/README.md, 2026. Accessed 2026-04-23

  60. [60]

    Alibaba hpn: A data center network for large language model training

    Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, et al. Alibaba hpn: A data center network for large language model training. InProceedings of the ACM SIGCOMM 2024 Conference, pages 691–706, 2024

  61. [61]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

  62. [62]

    Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

    Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. InProceedings of the international conference for high performance computing, networking, storage and analysis, pages 1–14, 2021

  63. [63]

    MLPerf inference benchmark

    Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breez, et al. MLPerf inference benchmark. InISCA, 2020

  64. [64]

    ZeRO-Offload: Democratizing Billion-Scale model training

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing Billion-Scale model training. In2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564. USENIX Association, July 2021. ISBN 978-1-939133-23-6. URL https://www.usenix.org/conference/atc21/p...

  65. [65]

    Xla: Compiling machine learning for peak performance

    Amit Sabne. Xla: Compiling machine learning for peak performance. 2020

  66. [66]

    InferenceX: Inference benchmarking

    SemiAnalysis. InferenceX: Inference benchmarking. https://inferencex.semianalysis.com/ inference, 2026. Accessed 2026-04-26

  67. [67]

    SGLang bench serving guide

    SGLang Team. SGLang bench serving guide. https://github.com/sgl-project/sglang/blob/ main/docs/developer_guide/bench_serving.md, 2026. Accessed 2026-04-26

  68. [68]

    SGLang documentation.https://docs.sglang.io/, 2026

    SGLang Team. SGLang documentation.https://docs.sglang.io/, 2026. Accessed 2026-04-26

  69. [69]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  70. [70]

    Dynamollm: Designing llm inference clusters for performance and energy efficiency

    Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1348–1362. IEEE, 2025

  71. [71]

    Triton: an intermediate language and compiler for tiled neural network computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

  72. [72]

    Ultra Accelerator Link (UALink) 200G 1.0 Specification

    UALink Consortium. Ultra Accelerator Link (UALink) 200G 1.0 Specification. https:// ualinkconsortium.org/specification/, 2026. Accessed: 2026-05-06

  73. [73]

    vLLM Bench Latency

    vLLM Project. vLLM Bench Latency. https://docs.vllm.ai/en/stable/cli/bench/latency/. Accessed: 2026-05-05

  74. [74]

    vLLM v0.6.0: 2.7x throughput improvement and 5x latency reduction

    vLLM Team. vLLM v0.6.0: 2.7x throughput improvement and 5x latency reduction. https://vllm.ai/ blog/perf-update, 2024. Accessed 2026-04-26. 13

  75. [75]

    vLLM benchmarks

    vLLM Team. vLLM benchmarks. https://docs.vllm.ai/en/v0.14.1/api/vllm/benchmarks/,

  76. [76]

    SimAI: Unifying architecture design and performance tuning for large-scale large language model training with scalability and precision.NSDI, 2025

    Xizheng Wang et al. SimAI: Unifying architecture design and performance tuning for large-scale large language model training with scalability and precision.NSDI, 2025

  77. [77]

    A systematic methodology for analysis of deep learning hardware and software platforms.Proceedings of Machine Learning and Systems, 2:30–43, 2020

    Yu Wang, Gu-Yeon Wei, and David Brooks. A systematic methodology for analysis of deep learning hardware and software platforms.Proceedings of Machine Learning and Systems, 2:30–43, 2020

  78. [78]

    Burstgpt: A real-world workload dataset to optimize llm serving systems

    Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al. Burstgpt: A real-world workload dataset to optimize llm serving systems. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5831–5841, 2025

  79. [79]

    ASTRA-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale.arXiv preprint arXiv:2303.14006, 2023

    William Won, Taekyung Shi, Dhruvit Ajith, Saeed Sudarshan, Adarsh Ravichandran, and Tushar Krishna. ASTRA-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale.arXiv preprint arXiv:2303.14006, 2023

  80. [80]

    LLMCompass: Enabling efficient hardware design for large language model inference

    Hengrui Zhang, August Ning, Rohan Baskar Prabhakar, and David Wentzlaff. LLMCompass: Enabling efficient hardware design for large language model inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1080–1096. IEEE, 2024. doi: 10.1109/ISCA59077. 2024.00082

Showing first 80 references.