pith. machine review for the scientific record. sign in

arxiv: 2605.11333 · v1 · submitted 2026-05-11 · 💻 cs.DC · cs.LG· cs.PF

Recognition: no theorem link

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:24 UTC · model grok-4.3

classification 💻 cs.DC cs.LGcs.PF
keywords Chakra execution traceperformance benchmarkingdistributed AI workloadsSW-HW co-designexecution tracesMLCommonsgraph-based representationsimulators and emulators
0
0 comments X

The pith

Chakra establishes a standardized graph-based execution trace format to represent distributed AI workloads for benchmarking and co-design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Chakra as an open ecosystem designed to support observation, reproduction, and optimization of distributed machine learning workloads. Its central element is the Chakra execution trace, a portable graph representation that records compute, memory, and communication operations together with dependencies, timing, and resource limits. This approach addresses the need for consistent methods to analyze production AI systems amid rapid hardware and software changes. By supplying supporting tools for trace collection, generation, and use in simulators and emulators, the work seeks to enable broader interoperability and more efficient software-hardware co-design.

Core claim

The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace (ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. Analysis of Chakra ETs collected on production AI clusters and real-world case studies demonstrate its value, with adoption by MLCommons and contributions from industry partners.

What carries the argument

The Chakra execution trace (ET), a graph-based representation that encodes distributed AI/ML workloads through operations, dependencies, timing, and constraints.

If this is right

  • Traces collected from production AI clusters can be analyzed to understand workload behavior.
  • Case studies can validate practical benefits in performance optimization.
  • Multiple simulators and replay tools can integrate the format for consistent benchmarking.
  • Industry partners can contribute to and use a shared representation for co-design efforts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A common trace format could reduce redundant work when moving performance models between different simulation platforms.
  • Patterns visible in standardized traces might identify recurring bottlenecks to guide targeted hardware improvements.
  • The approach could extend to non-AI distributed workloads if the graph structure proves sufficiently general.

Load-bearing premise

The assumption that simulators, emulators, and replay tools will adopt the Chakra ET format and tools to deliver meaningful improvements in benchmarking and co-design.

What would settle it

Demonstration that leading simulation tools continue to rely on proprietary formats and produce equivalent or superior co-design results without using Chakra ETs.

Figures

Figures reproduced from arXiv: 2605.11333 by Andy Balogh, Ashwin Ramachandran, Bradford M. Beckmann, Brian Coutinho, Changhai Man, Dan Mihailescu, David Kanter, Hanjiang Wu, Huan Xu, Jinsun Yoo, Joongun Park, Josh Ladd, Louis Feng, Mehryar Garakani, Phio Tian, Puneet Sharma, Saeed Rashidi, Sanshan Gao, Sheng Fu, Spandan More, Srinivas Sridharan, Taekyung Heo, Tushar Krishna, Vijay Janapa Reddi, Vinay Ramakrishnaiah, William Won, Winston Liu, Ziwei Li.

Figure 1
Figure 1. Figure 1: AI system SW-HW co-design flow [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Chakra Infrastructure Overview. to describe distributed AI workload performance behavior over an AI platform. Analogous to instruction and mem￾ory traces (Ranganathan & Victor), ETs record operator dimensions for compute and communication and their de￾pendencies while avoiding disclosure of model or dataset details. Software organizations can share ETs of internal workloads with hardware vendors, who can i… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Chakra ET visualization example. broad tasks: analysis, replay and simulation/emulation that are required at different times within the development cycle of AI platforms. The open schema enables interoperability across different stages and diverse open/proprietary tools. 4.1 Trace Analysis Chakra offers a range of open-source tools to help users vi￾sualize, analyze, and consume execution traces. We describ… view at source ↗
Figure 6
Figure 6. Figure 6: Normalized execution time breakdown across workloads for traces collected on the system mentioned in Sec. 5. For each workload, we show measured performance from Kineto (left) and the performance via trace reconstruction through Chakra (right). AllToAll AllGather ReduceScatter AllReduce Collective Communication Type 0.0 0.2 0.4 0.6 0.8 1.0 Total Duration (µs) 1e7 4.1× slower 4.4× slower 1.5× slower 9.7× sl… view at source ↗
Figure 7
Figure 7. Figure 7: Total collective communication runtime comparison at 400 Gb/s and 100 Gb/s InfiniBand. Measured on training Mixtral￾8×22B with 32 GPUs (four HGX-8×H200 nodes, TP/SP=4, EP=8) and the global batch size of 32. open-source tools like Genie (Yoo et al., 2026b) as well as commercial system emulators like Keysight AI Data Center Builder (Keysight Technologies, 2025), which now support the Chakra format for worklo… view at source ↗
Figure 8
Figure 8. Figure 8: GPU memory utilization for different LLM models dur￾ing one training step. Traces are aligned relative to the start of each epoch. Each model and its corresponding parallelization match the first entry (row) in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Compute characteristics of the Mixtral-8x22-Chakra trace. (a) Most compute kernels complete within 2–102 µs. (b) The majority of nodes have 10–500 parent data dependencies. 5.2 Trace Replay Case Studies Replaying Chakra ETs on real systems allows reproduc￾ing the exact workload behavior either fully (replay both compute and comms operations) or partial replay (replay selective operations). The latter enabl… view at source ↗
Figure 10
Figure 10. Figure 10: Bus bandwidth per iteration when (a) All-Reduce (b) All-to-All (c) mixing All-to-All and All-Reduce in one time span. AllReduce1 AllReduce10 AllReduce2 AllReduce3 AllReduce4 AllReduce5 AllReduce6 AllReduce7 AllReduce8 AllReduce9 AllToAll1 AllToAll10 AllToAll2 AllToAll3 AllToAll4 AllToAll5 AllToAll6 AllToAll7 AllToAll8 AllToAll9 Percentile 100 80 60 40 20 0 Completion Time (ms) 5 10 15 20 25 30 35 40 45 50… view at source ↗
Figure 11
Figure 11. Figure 11: Mixing collectives results of CDF. Result. The experiment revealed a significant performance anomaly when interleaving All-Reduce and All-to-All col￾lectives. While both collectives show stable performance in isolation ( [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Communication time for different network topology and bandwidth with Mixtral 8x7B target. connected. Additionally, we test bandwidths ranging from 75 GB/s to 900 GB/s. The Mixtral 8×7B model serves as the workload for this evaluation. Results [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of token routing among two expert parallel rank for each model layer. The input has six tokens and the model used is Mixtral 8x7B with 32 layers. 0 4 8 12 16 20 24 28 Model Layer ID 110 120 130 140 150 160 170 180 190 KV Transfer Duration (µs) Send (prefill) Recv (decode) [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Runtime breakdown of the KV cache transfer for in￾ferencing Llama3-8B between one prefill and decode GPU. The captured trace denotes the per-layer (32 layers for Llama3-8B) send and receive latency between two GPUs. 5.5.3 KV-Cache Transfer In inference, when disaggregating prefill and decode stages on different GPUs (Patel et al., 2024; Zhong et al., 2024; Bambhaniya et al., 2026), it introduces unique po… view at source ↗
read the original abstract

The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Chakra, an open ecosystem for performance benchmarking and co-design of distributed AI/ML workloads. The core is the Chakra execution trace (ET), a graph-based representation capturing compute, memory, communication operations, dependencies, timing, and resource constraints. It provides tools for ET collection, analysis, generation, and integration with simulators, emulators, and replay tools. The paper includes analysis of production traces from AI clusters and real-world case studies, noting adoption by MLCommons and contributions from NVIDIA, AMD, Meta, and other industry players.

Significance. Should the standardized ET format achieve broad adoption, it would offer a significant advance by enabling consistent representation of complex distributed workloads, facilitating cross-tool interoperability, and supporting more effective SW-HW co-design for AI systems. The involvement of MLCommons and multiple vendors strengthens the potential for impact in the field.

major comments (2)
  1. The central claim that Chakra advances benchmarking and co-design relies on the adoption and use by simulators and emulators, yet the manuscript provides no quantitative interoperability test results or adoption metrics to demonstrate this (see Abstract and the section on tools and adoption).
  2. The real-world case studies are mentioned but lack specific quantitative before-and-after metrics on improvements in benchmarking accuracy or co-design outcomes attributable to the Chakra ET format (see the case studies section).
minor comments (2)
  1. The definition of the ET graph structure (nodes, edges, attributes for timing and constraints) would benefit from a formal specification or concrete example diagram in the main text for clarity.
  2. Ensure the related work section includes comparisons to prior execution trace formats such as those used in existing ML performance tools.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, with indications of planned revisions to the next version of the paper.

read point-by-point responses
  1. Referee: The central claim that Chakra advances benchmarking and co-design relies on the adoption and use by simulators and emulators, yet the manuscript provides no quantitative interoperability test results or adoption metrics to demonstrate this (see Abstract and the section on tools and adoption).

    Authors: We acknowledge that the manuscript currently presents adoption primarily through qualitative statements regarding MLCommons standardization and contributions from partners including NVIDIA, AMD, Meta, Keysight, HPE, and Scala. No numerical metrics on the number of integrations or results from formal interoperability tests (e.g., cross-simulator fidelity comparisons) are included. In the revised manuscript we will expand the tools and adoption section with all currently available quantitative indicators of usage and will add descriptions of interoperability validation performed during tool development. Comprehensive cross-tool benchmark suites remain an area of active community development. revision: partial

  2. Referee: The real-world case studies are mentioned but lack specific quantitative before-and-after metrics on improvements in benchmarking accuracy or co-design outcomes attributable to the Chakra ET format (see the case studies section).

    Authors: The case studies section illustrates how Chakra ETs were collected from production clusters and used to drive analysis and co-design decisions. We agree that explicit before-and-after quantitative comparisons would strengthen the claims. In the revision we will augment the case studies with available quantitative results from the trace analyses, including measured improvements in bottleneck identification and any co-design outcomes that can be directly attributed to the standardized representation. revision: yes

Circularity Check

0 steps flagged

No circularity: new format definition with no derivations or self-referential reductions

full rationale

The manuscript introduces Chakra as a new open graph-based execution trace format and supporting ecosystem for AI/ML workload representation. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text or abstract. The core claim is definitional (the ET format captures ops, dependencies, timing, and constraints) and ecosystem-oriented (tools for collection/analysis plus asserted industry adoption). No step reduces by construction to prior inputs, self-citations, or fitted quantities; the paper is self-contained as a standards proposal without internal logical loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the utility of a newly defined graph-based trace format and its adoption by simulators and industry; no numerical free parameters are introduced.

axioms (1)
  • domain assumption Graph-based structures can faithfully capture compute, memory, communication operations, dependencies, timing, and resource constraints in distributed ML systems.
    This is the foundational premise for defining the Chakra ET representation.
invented entities (1)
  • Chakra execution trace (ET) no independent evidence
    purpose: Portable, interoperable graph representation of distributed AI/ML workload behavior including operations, dependencies, timing, and constraints.
    Newly introduced format intended to enable collection, analysis, and use across simulators and emulators.

pith-pipeline@v0.9.0 · 5621 in / 1372 out tokens · 56571 ms · 2026-05-13T01:24:54.212830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 7 internal anchors

  1. [1]

    vLLM Github , howpublished =

  2. [2]

    Ultra Ethernet Consortium , howpublished =

  3. [3]

    2026 , eprint=

    Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML , author=. 2026 , eprint=

  4. [5]

    2024 , howpublished =

    Man, C , title =. 2024 , howpublished =

  5. [6]

    2025 , eprint=

    Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training , author=. 2025 , eprint=

  6. [7]

    Jia, Zhihao and Zaharia, Matei and Aiken, Alex , booktitle=

  7. [8]

    2020 , organization=

    Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. 2020 , organization=

  8. [9]

    Ranganathan, Parthasarathy and Lee Victor , url =

  9. [10]

    Manpreet Singh Minhas , url =

  10. [11]

    2023 , eprint=

    Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces , author=. 2023 , eprint=

  11. [12]

    2020 , organization=

    Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar , booktitle=. 2020 , organization=

  12. [13]

    IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) , year=

    William Won and Taekyung Heo and Saeed Rashidi and Srinivas Sridharan and Sudarshan Srinivasan and Tushar Krishna , title=. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) , year=

  13. [14]

    2011 , publisher=

    Rodrigues, Arun F and Hemmert, K Scott and Barrett, Brian W and Kersey, Chad and Oldfield, Ron and Weston, Marlo and Risen, Rolf and Cook, Jeanine and Rosenfeld, Paul and Cooper-Balis, Elliot and others , journal=. 2011 , publisher=

  14. [15]

    2019 , organization=

    Wang, Fei and Chen, Guoyang and Zhang, Weifeng and Rompf, Tiark , booktitle=. 2019 , organization=

  15. [16]

    Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation , articleno =

    Wang, Xizheng and Li, Qingxu and Xu, Yichi and Lu, Gang and Li, Dan and Chen, Li and Zhou, Heyang and Zheng, Linkang and Zhang, Sen and Zhu, Yikai and Liu, Yang and Zhang, Pengcheng and Qian, Kun and He, Kunling and Gao, Jiaqi and Zhai, Ennan and Cai, Dennis and Fu, Binzhang , title =. Proceedings of the 22nd USENIX Symposium on Networked Systems Design a...

  16. [17]

    Santhanam, Keshav and Krishna, Siddharth and Tomioka, Ryota and Harris, Tim and Zaharia, Matei , journal=

  17. [18]

    Schaarschmidt, Michael and Grewe, Dominik and Vytiniotis, Dimitrios and Paszke, Adam and Schmid, Georg Stefan and Norman, Tamara and Molloy, James and Godwin, Jonathan and Rink, Norman Alexander and Nair, Vinod and others , journal=

  18. [19]

    and Lee, Jaewon and Lundell, John and Kim, Changkyu and Kejariwal, Arun and Owens, John D

    Lin, Zhongyi and Feng, Louis and Ardestani, Ehsan K. and Lee, Jaewon and Lundell, John and Kim, Changkyu and Kejariwal, Arun and Owens, John D. , booktitle=. Building a Performance Model for Deep Learning Recommendation Model Training on GPUs , year=

  19. [20]

    DreamShard: Generalizable Embedding Table Placement for Recommender Systems , url =

    Zha, Daochen and Feng, Louis and Tan, Qiaoyu and Liu, Zirui and Lai, Kwei-Herng and Bhushanam, Bhargav and Tian, Yuandong and Kejariwal, Arun and Hu, Xia , booktitle =. DreamShard: Generalizable Embedding Table Placement for Recommender Systems , url =

  20. [21]

    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

    Autoshard: Automated embedding table sharding for recommender systems , author=. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

  21. [23]

    Mingyu Liang and Wenyin Fu and Louis Feng and Zhongyi Lin and Pavani Panakanti and Shengbao Zheng and Srinivas Sridharan and Christina Delimitrou , year=

  22. [24]

    FacebookResearch , title =

  23. [25]

    Facebook Research , title =

  24. [26]

    Zenodo , year=

    TensorFlow , author=. Zenodo , year=

  25. [27]

    2017 , howpublished =

    ONNX , title =. 2017 , howpublished =

  26. [28]

    Advances in neural information processing systems , volume=

    Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

  27. [29]

    Proceedings of the 53rd IEEE/ACM International Symposium on Computer Architecture (ISCA '26) , year =

    Scalable Synthesis of LLM Benchmarks through Symbolic Tensor Graphs , author =. Proceedings of the 53rd IEEE/ACM International Symposium on Computer Architecture (ISCA '26) , year =

  28. [30]

    2024 , eprint=

    Splitwise: Efficient generative LLM inference using phase splitting , author=. 2024 , eprint=

  29. [32]

    2026 , month = dec, howpublished =

    vLLM , title =. 2026 , month = dec, howpublished =

  30. [33]

    Keysight AI Data Center Builder , year =

  31. [34]

    2015 , eprint=

    Deep Residual Learning for Image Recognition , author=. 2015 , eprint=

  32. [35]

    2019 , eprint=

    Deep Learning Recommendation Model for Personalization and Recommendation Systems , author=. 2019 , eprint=

  33. [36]

    2024 , eprint=

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models , author=. 2024 , eprint=

  34. [37]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  35. [38]

    2024 , eprint=

    Mixtral of Experts , author=. 2024 , eprint=

  36. [39]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  37. [40]

    2024 , eprint=

    DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving , author=. 2024 , eprint=

  38. [41]

    Chakra Execution Traces: Benchmarking Network Performance Optimization , year =

  39. [42]

    2023 , month = jul, howpublished =

    Chakra: Advancing Benchmarking and Co-design for Future AI Systems , author =. 2023 , month = jul, howpublished =

  40. [43]

    2026 , eprint=

    Understanding and Optimizing Multi-Stage AI Inference Pipelines , author=. 2026 , eprint=

  41. [44]

    The 13th International Conference on Learning Representations (ICLR) , year=

    LayerDAG: A Layerwise Autoregressive Diffusion Model of Directed Acyclic Graphs for System , author=. The 13th International Conference on Learning Representations (ICLR) , year=

  42. [45]

    2025 , urldate =

    Kineto: A CPU. 2025 , urldate =

  43. [46]

    2025 , publisher =

    Bortok, Alex , title =. 2025 , publisher =

  44. [47]

    2025 , publisher =

    Bergeron, Matt and Kumar, Ashutosh , title =. 2025 , publisher =

  45. [48]

    2025 , publisher =

    Wareing, Richard and Graf, Tyler , title =. 2025 , publisher =

  46. [49]

    ASTRA-sim: Scalable System-Level Simulation Framework for Large-Scale Machine Learning Systems , author =

  47. [50]

    2025 , month = feb, url =

    deepseek-ai , title =. 2025 , month = feb, url =

  48. [52]

    2024 , pages=

    Won, William and Rashidi, Saeed and Srinivasan, Sudarshan and Krishna, Tushar , booktitle=. 2024 , pages=

  49. [53]

    TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning , year=

    Won, William and Elavazhagan, Midhilesh and Srinivasan, Sudarshan and Gupta, Swati and Krishna, Tushar , booktitle=. TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning , year=

  50. [54]

    2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) , pages=

    Enabling compute-communication overlap in distributed deep learning training platforms , author=. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) , pages=. 2021 , organization=

  51. [55]

    Proceedings of the 52nd Annual International Symposium on Computer Architecture , pages=

    FRED: A Wafer-scale Fabric for 3D Parallel DNN Training , author=. Proceedings of the 52nd Annual International Symposium on Computer Architecture , pages=

  52. [56]

    The Scala Compute Platform , author =

  53. [57]

    Georgia Tech AI Makerspace , author =

  54. [58]

    Astra-sim: Scalable system-level simulation framework for large-scale machine learning systems

    ASTRA-sim . Astra-sim: Scalable system-level simulation framework for large-scale machine learning systems. https://astra-sim.github.io/, 2025. Accessed: 2025-10-30

  55. [59]

    MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference

    Bambhaniya, A. R., Wu, H., Subramanian, S., Srinivasan, S., Kundu, S., Yazdanbakhsh, A., Elavazhagan, M., Kumar, M., and Krishna, T. Understanding and optimizing multi-stage ai inference pipelines, 2026. URL https://arxiv.org/abs/2504.09775

  56. [60]

    and Kumar, A

    Bergeron, M. and Kumar, A. Accelerating AI Hardware NPI - Clusterless Validation of GPUs and Networking . https://youtu.be/-PRs1eVF3nY?si=sYl3P0tSsmEAJOL2, 2025. OCP Global Summit

  57. [61]

    Methodology and Observation of Congestion Control Impact on MoE Training Job Completion Time

    Bortok, A. Methodology and Observation of Congestion Control Impact on MoE Training Job Completion Time . https://youtu.be/nLSDrgvu-qw?si=EJnOJ__zB35delA1, 2025. OCP Global Summit

  58. [62]

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

  59. [63]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., Xie, Z., Li, Y. K., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, 2024. URL https://arxiv.org/abs/2401.06066

  60. [64]

    Holistic trace analysis

    FacebookResearch. Holistic trace analysis. https://github.com/facebookresearch/HolisticTraceAnalysis. Accessed: 2025-09-28

  61. [65]

    [PyTorch] Integrate Execution Graph Observer into PyTorch Profiler

    Feng, L. [PyTorch] Integrate Execution Graph Observer into PyTorch Profiler . URL https://github.com/pytorch/pytorch/pull/75358

  62. [66]

    Georgia tech ai makerspace

    Georgia Institute of Technology . Georgia tech ai makerspace. https://coe.gatech.edu/academics/ai-for-engineering/ai-makerspace, 2026. Accessed: 2026-04-06

  63. [67]

    Characterizing the efficiency of distributed training: A power, performance, and thermal perspective

    Go, S., Park, J., More, S., Wu, H., Wang, I., Jezghani, A., Krishna, T., and Mahajan, D. Characterizing the efficiency of distributed training: A power, performance, and thermal perspective. In Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, MICRO '25, pp.\ 626–642, New York, NY, USA, 2025. Association for Computing Machiner...

  64. [68]

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux...

  65. [69]

    Deep Residual Learning for Image Recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385

  66. [70]

    Mixtral of Experts

    Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mixtral...

  67. [71]

    Infrastructure Graph

    Keysight. Infrastructure Graph . URL https://infragraph.dev/

  68. [72]

    Keysight ai data center builder

    Keysight Technologies . Keysight ai data center builder. https://github.com/Keysight/kai-dc-builder/releases/download/v1.0.2/Keysight.AI.Data.Center.Builder.Solution.Brief.pdf, 2025. Solution brief. Accessed: 2026-03-31

  69. [73]

    Layerdag: A layerwise autoregressive diffusion model of directed acyclic graphs for system

    Li, M., Shitole, V., Chien, E., Man, C., Wang, Z., Zhang, Y., Krishna, T., Li, P., et al. Layerdag: A layerwise autoregressive diffusion model of directed acyclic graphs for system. In The 13th International Conference on Learning Representations (ICLR), 2024

  70. [74]

    Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

    Liang, M., Fu, W., Feng, L., Lin, Z., Panakanti, P., Zheng, S., Sridharan, S., and Delimitrou, C. Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks . In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), 2023

  71. [75]

    T., Fu, W., Coutinho, B., Feng, L., and Delimitrou, C

    Liang, M., Kassa, H. T., Fu, W., Coutinho, B., Feng, L., and Delimitrou, C. Lumos: Efficient performance modeling and estimation for large-scale llm training, 2025. URL https://arxiv.org/abs/2504.09307

  72. [76]

    symbolic\_tensor\_graph: A symbolic tensor graph generator for astra-sim

    Man, C. symbolic\_tensor\_graph: A symbolic tensor graph generator for astra-sim. https://github.com/astra-sim/symbolic_tensor_graph, 2024. Accessed: 2025-10-29

  73. [77]

    Scalable synthesis of llm benchmarks through symbolic tensor graphs

    Man, C., Park, J., Wu, H., Xu, H., Sridharan, S., and Krishna, T. Scalable synthesis of llm benchmarks through symbolic tensor graphs. In Proceedings of the 53rd IEEE/ACM International Symposium on Computer Architecture (ISCA '26), 2026

  74. [78]

    Chakra execution traces: Benchmarking network performance optimization

    Meta . Chakra execution traces: Benchmarking network performance optimization. https://engineering.fb.com/2023/09/07/networking-traffic/chakra-execution-traces-benchmarking-network-performance-optimization/, September 2023. Accessed: 2026-03-31

  75. [79]

    Minhas, M. S. Computational Graphs in PyTorch and TensorFlow . URL https://towardsdatascience.com/computational-graphs-in-pytorch-and-tensorflow-c25cc40bdcd1. Accessed: 2025-10-01

  76. [80]

    Chakra: Advancing benchmarking and co-design for future ai systems

    MLCommons . Chakra: Advancing benchmarking and co-design for future ai systems. https://mlcommons.org/2023/07/chakra-advancing-benchmarking-and-co-design-for-future-ai-systems/, July 2023. Accessed: 2026-03-31

  77. [81]

    MLPerf Storage Benchmark

    MLPerf. MLPerf Storage Benchmark . URL https://mlcommons.org/benchmarks/storage/

  78. [82]

    Deep Learning Recommendation Model for Personalization and Recommendation Systems

    Naumov, M., Mudigere, D., Shi, H.-J. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C.-J., Azzolini, A. G., Dzhulgakov, D., Mallevich, A., Cherniavskii, I., Lu, Y., Krishnamoorthi, R., Yu, A., Kondratenko, V., Pereira, S., Chen, X., Chen, W., Rao, V., Jia, B., Xiong, L., and Smelyanskiy, M. Deep learning recommendation model for persona...

  79. [83]

    NVIDIA NeMo

    NeMo . NVIDIA NeMo . https://www.nvidia.com/en-us/ai-data-science/products/nemo/. Accessed: 2025-10-01

  80. [84]

    Splitwise: Efficient generative llm inference using phase splitting,

    Patel, P., Choukse, E., Zhang, C., Shah, A., Íñigo Goiri, Maleki, S., and Bianchini, R. Splitwise: Efficient generative llm inference using phase splitting, 2024. URL https://arxiv.org/abs/2311.18677

Showing first 80 references.