pith. machine review for the scientific record. sign in

arxiv: 2604.12090 · v1 · submitted 2026-04-13 · 💻 cs.DC

Recognition: unknown

Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:58 UTC · model grok-4.3

classification 💻 cs.DC
keywords StableHLOperformance modelingdistributed MLcross-architectureGPUTPUsimulatorsworkload representation
0
0 comments X

The pith

StableHLO can serve as a single portable representation for predicting distributed ML workload performance across GPUs, TPUs, and varied simulators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether StableHLO offers a common way to express machine learning workloads so that their performance can be modeled on different accelerator types without rewriting the description each time. It maps the same workload description into analytical models, profiling data, and full simulators to compare outcomes on GPUs versus TPUs. The evaluation covers distributed matrix multiplication kernels, ResNet models, and large language model training, showing that relative performance differences between architectures are captured with errors small enough for early design work. This approach exposes where current simulators lose accuracy on certain workloads. The results support using one StableHLO form for repeated evaluations during ML system planning.

Core claim

The study establishes a StableHLO-based simulation methodology that maps a single workload representation onto multiple performance models, spanning analytical, profiling-based, and simulator-driven predictors. Using this methodology, workloads are evaluated across GPUs and TPUs without requiring access to scaled-out physical systems, enabling systematic comparison across modeling fidelities. An empirical evaluation covering distributed GEMM kernels, ResNet, and large language model training workloads demonstrates that StableHLO preserves relative performance trends across architectures and fidelities, while exposing accuracy trade-offs and simulator limitations. Across evaluated scenarios,

What carries the argument

StableHLO dialect as a unified, portable representation of ML operations that feeds into multiple separate performance predictors.

If this is right

  • Workloads can be compared across GPUs and TPUs using one description and without building full hardware clusters.
  • Different prediction methods can be applied to the same StableHLO form to reveal accuracy differences.
  • Simulator shortcomings become visible when the same workload runs through multiple modeling paths.
  • Reusable workflows emerge for checking performance during early ML system design stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could iterate on hardware choices faster by testing many configurations from one workload file before any physical runs.
  • Cross-checking outputs from different simulators against each other might expose and reduce individual tool biases.
  • The method could extend naturally to additional accelerator types once their simulators accept StableHLO input.
  • Adoption might lower the barrier for comparing new distributed training strategies across mixed hardware environments.

Load-bearing premise

Converting workloads into StableHLO keeps the performance traits that matter for predicting relative differences across architectures and simulators.

What would settle it

Direct measurements on physical multi-GPU and multi-TPU clusters that show the StableHLO-based models fail to match actual relative speedups or efficiencies for the tested workloads.

Figures

Figures reproduced from arXiv: 2604.12090 by Abubakr Nada, Arjun Singh, Changhai Man, Debjyoti Bhattacharjee, James Myers, Jonas Svedas, Nathan Laubeuf, Ryan Harvey, Tushar Krishna.

Figure 1
Figure 1. Figure 1: Simplified ML stack overview: from model specification [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the StableHLO-based evaluation methodology for distributed ML performance modeling. Workloads are [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compiler pathways enabled by StableHLO-MLIR. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fragment of StableHLO-MLIR illustrating a matrix [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: TPUv3-8 system diagram and key parameters. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training-step latency estimate for ResNet variants on a [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training-step latency estimates of single training [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Llama-2 scale-out simulation results using ASTRA-sim [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Operator-level benchmarking of matrix multiplica [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: IREE–Accel-Sim integration flow used to evaluate [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
read the original abstract

Predicting the performance of large-scale distributed machine learning (ML) workloads across multiple accelerator architectures remains a central challenge in ML system design. Existing GPU and TPU focused simulators are typically architecture-specific, while distributed training simulators rely on workload-specific analytical models or costly post-execution traces, limiting portability and cross-platform comparison. This work evaluates whether MLIR's StableHLO dialect can serve as a unified workload representation for cross-architecture and cross-fidelity performance modeling of distributed ML workloads. The study establishes a StableHLO-based simulation methodology that maps a single workload representation onto multiple performance models, spanning analytical, profiling-based, and simulator-driven predictors. Using this methodology, workloads are evaluated across GPUs and TPUs without requiring access to scaled-out physical systems, enabling systematic comparison across modeling fidelities. An empirical evaluation covering distributed GEMM kernels, ResNet, and large language model training workloads demonstrates that StableHLO preserves relative performance trends across architectures and fidelities, while exposing accuracy trade-offs and simulator limitations. Across evaluated scenarios, prediction errors remain within practical bounds for early-stage design exploration, and the methodology reveals fidelity-dependent limitations in existing GPU simulators. These results indicate that StableHLO provides a viable foundation for unified, distributed ML performance modeling across accelerator architectures and simulators, supporting reusable evaluation workflows and cross-validation throughout the ML system design process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates MLIR's StableHLO dialect as a unified representation for cross-architecture performance modeling of distributed ML workloads. It maps workloads (distributed GEMM, ResNet, LLM training) to StableHLO and applies them to analytical, profiling-based, and simulator-driven predictors spanning GPUs and TPUs. Empirical results show preserved relative performance trends across architectures and modeling fidelities, with prediction errors within practical bounds for early design, while exposing limitations in existing GPU simulators. The authors conclude that StableHLO enables reusable evaluation workflows and cross-validation in ML system design.

Significance. If substantiated, this provides a meaningful advance for portable performance modeling in distributed ML systems by reducing reliance on architecture-specific tools. The empirical breadth across workloads and fidelities, plus identification of simulator limitations, offers practical value for early-stage design exploration. The use of a standard dialect supports reproducibility and could standardize cross-architecture comparisons in the ML systems community.

major comments (2)
  1. [Methodology] The central claim that StableHLO mappings preserve relative performance trends without losing distributed execution details (communication patterns, memory hierarchies, scaling) is load-bearing for the viability conclusion. The methodology section must explicitly describe how StableHLO represents collectives and tensor layouts, with direct comparisons to native simulator inputs, to demonstrate that observed trend preservation is not an artifact of the selected workloads.
  2. [Empirical Evaluation] The abstract states that prediction errors remain within practical bounds, but the evaluation section must report concrete error metrics (e.g., MAPE or absolute relative error per workload), simulator configurations, data exclusion rules, and per-scenario results (including any tables comparing predictions to ground truth). Without these, the support for the 'practical bounds for early-stage design' claim cannot be verified.
minor comments (2)
  1. Define all performance predictor notations and fidelity levels in a dedicated table or glossary for clarity across the results.
  2. Include an appendix with full simulator versions, StableHLO dialect commit, and workload mapping scripts to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to revise our manuscript based on the referee's insightful comments. We address each major comment below and have made revisions to improve clarity and substantiation of our claims.

read point-by-point responses
  1. Referee: [Methodology] The central claim that StableHLO mappings preserve relative performance trends without losing distributed execution details (communication patterns, memory hierarchies, scaling) is load-bearing for the viability conclusion. The methodology section must explicitly describe how StableHLO represents collectives and tensor layouts, with direct comparisons to native simulator inputs, to demonstrate that observed trend preservation is not an artifact of the selected workloads.

    Authors: We agree with this assessment and have revised the Methodology section to provide a detailed description of StableHLO's representation of collectives and tensor layouts. Specifically, we now include explanations of how operations like AllReduce and AllGather are mapped, along with tensor sharding and layout specifications. We also added direct comparisons between StableHLO-derived inputs and native simulator configurations for the GPU and TPU models, using the distributed GEMM and ResNet workloads as examples. These additions confirm that the trend preservation is not an artifact but stems from faithful representation of distributed aspects. revision: yes

  2. Referee: [Empirical Evaluation] The abstract states that prediction errors remain within practical bounds, but the evaluation section must report concrete error metrics (e.g., MAPE or absolute relative error per workload), simulator configurations, data exclusion rules, and per-scenario results (including any tables comparing predictions to ground truth). Without these, the support for the 'practical bounds for early-stage design' claim cannot be verified.

    Authors: We have updated the Empirical Evaluation section to include the requested details. Concrete metrics such as Mean Absolute Percentage Error (MAPE) are now reported for each workload (distributed GEMM, ResNet, LLM training) across architectures and modeling fidelities. Simulator configurations, including versions and key parameters, are specified, along with data exclusion rules (e.g., outlier handling). New tables have been added showing per-scenario prediction vs. ground truth comparisons. These revisions provide verifiable support for the practical bounds claim. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation of StableHLO mappings

full rationale

The paper presents an empirical methodology that maps distributed ML workloads (GEMM, ResNet, LLM training) to StableHLO and evaluates preservation of relative performance trends across GPU/TPU simulators and analytical models. Claims rest on direct experimental comparisons of prediction errors and trend fidelity against external benchmarks, without any derivations, equations, fitted parameters redefined as predictions, or self-citation chains that reduce the central viability conclusion to its own inputs by construction. The evaluation is self-contained against measured data and cross-fidelity checks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper rests on standard assumptions about workload representability in StableHLO and the fidelity of existing performance models; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5570 in / 1078 out tokens · 42364 ms · 2026-05-10T14:58:01.026755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Training compute of frontier ai models grows by 4-5x per year,

    J. Sevilla and E. Rold ´an, “Training compute of frontier ai models grows by 4-5x per year,” 2024, available at: https://epoch.ai/blog/ training-compute-of-frontier-ai-models-grows-by-4-5x-per-year, Ac- cessed: 2024-12-03

  2. [2]

    Rethinking llm inference bottlenecks: Insights from latent attention and mixture-of-experts, 2026

    S. Yun, S. Park, H. Nam, Y . Lee, G. Lee, K. Kyung, S. Kim, N. S. Kim, J. Kim, H. Kimet al., “The new llm bottleneck: A systems perspective on latent attention and mixture-of-experts,”arXiv preprint arXiv:2507.15465, 2025

  3. [3]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017

  4. [4]

    Mamba: Linear-time sequence modeling with selective state spaces,

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst conference on language modeling, 2024

  5. [5]

    A survey of state of the art large vision language models: Benchmark evaluations and challenges,

    Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, and G. Shi, “A survey of state of the art large vision language models: Benchmark evaluations and challenges,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1587–1606

  6. [6]

    Flashattention: Fast and memory-efficient exact attention with io-awareness,

    T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in neural information processing systems, vol. 35, pp. 16 344–16 359, 2022

  7. [7]

    Triton: an intermediate language and compiler for tiled neural network computations,

    P. Tillet, H.-T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019, pp. 10–19

  8. [8]

    cuDNN: Efficient Primitives for Deep Learning

    S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,”arXiv preprint arXiv:1410.0759, 2014

  9. [9]

    Miopen: An open source library for deep learning primitives,

    J. Khan, P. Fultz, A. Tamazov, D. Lowell, C. Liu, M. Melesse, M. Nandhimandalam, K. Nasyrov, I. Perminov, T. Shahet al., “Miopen: An open source library for deep learning primitives,”arXiv preprint arXiv:1910.00078, 2019

  10. [10]

    onednn graph compiler: A hybrid approach for high-performance deep learning compilation,

    J. Li, Z. Qin, Y . Mei, J. Cui, Y . Song, C. Chen, Y . Zhang, L. Du, X. Cheng, B. Jinet al., “onednn graph compiler: A hybrid approach for high-performance deep learning compilation,” in2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2024, pp. 460–470

  11. [11]

    Xla : Compiling machine learning for peak performance,

    A. Sabne, “Xla : Compiling machine learning for peak performance,” 2020

  12. [12]

    Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,

    J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovskiet al., “Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume ...

  13. [13]

    Nvidia blackwell architecture technical overview,

    NVIDIA, “Nvidia blackwell architecture technical overview,” NVIDIA Corporation, Tech. Rep., March 2024, accessed: December 10, 2025. [On- line]. Available: https://resources.nvidia.com/en-us-blackwell-architecture

  14. [14]

    Introducing amd CDNA 3 architecture,

    AMD, “Introducing amd CDNA 3 architecture,” Advanced Micro Devices, Inc., White Paper 2258402-A, 2023, accessed: December 10, 2025. [On- line]. Available: https://www.amd.com/content/dam/amd/en/documents/ instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf

  15. [15]

    Ironwood: The first Google TPU for the age of inference,

    Google Cloud, “Ironwood: The first Google TPU for the age of inference,” https://blog.google/products/google-cloud/ ironwood-tpu-age-of-inference/, 2025, updated April 23, 2025; accessed: December 10, 2025

  16. [16]

    Stablehlo specification,

    OpenXLA Community, “Stablehlo specification,” https://openxla.org/ stablehlo/spec, 2023, accessed: 2025-09-24

  17. [17]

    Accel-sim: An extensible simulation framework for validated gpu modeling,

    M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 473–486

  18. [18]

    Amped: An analytical model for performance in dis- tributed training of transformers,

    D. Moolchandani, J. Kundu, F. Ruelens, P. Vrancx, T. Evenblij, and M. Perumkunnil, “Amped: An analytical model for performance in dis- tributed training of transformers,” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2023, pp. 306–315

  19. [19]

    Calculon: a methodology and tool for high-level co-design of systems and large language models,

    M. Isaev, N. McDonald, L. Dennison, and R. Vuduc, “Calculon: a methodology and tool for high-level co-design of systems and large language models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–14

  20. [20]

    Performance modeling and workload analysis of distributed large language model training and inference,

    J. Kundu, W. Guo, A. BanaGozar, U. De Alwis, S. Sengupta, P. Gupta, and A. Mallik, “Performance modeling and workload analysis of distributed large language model training and inference,”arXiv preprint arXiv:2407.14645, 2024

  21. [21]

    Distsim: A performance model of large-scale hybrid distributed dnn training,

    G. Lu, R. Chen, Y . Wang, Y . Zhou, R. Zhang, Z. Hu, Y . Miao, Z. Cai, L. Li, J. Lenget al., “Distsim: A performance model of large-scale hybrid distributed dnn training,” inProceedings of the 20th ACM International Conference on Computing Frontiers, 2023, pp. 112–122

  22. [22]

    Deepflow: A cross-stack pathfinding framework for distributed ai systems,

    N. Ardalani, S. Pal, and P. Gupta, “Deepflow: A cross-stack pathfinding framework for distributed ai systems,”ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 2, pp. 1–20, 2024

  23. [23]

    Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,

    W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023, pp. 283–294

  24. [24]

    {SimAI}: Unifying architecture design and performance tuning for {Large-Scale} large language model training with scalability and precision,

    X. Wang, Q. Li, Y . Xu, G. Lu, D. Li, L. Chen, H. Zhou, L. Zheng, S. Zhang, Y . Zhuet al., “ {SimAI}: Unifying architecture design and performance tuning for {Large-Scale} large language model training with scalability and precision,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), 2025, pp. 541–558

  25. [25]

    Accelerating design space exploration for {LLM} training systems with multi-experiment parallel simulation,

    F. Gui, K. Gao, L. Chen, D. Li, V . Liu, R. Zhang, H. Yang, and D. Xiong, “Accelerating design space exploration for {LLM} training systems with multi-experiment parallel simulation,” in22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), 2025, pp. 473–488

  26. [26]

    vtrain: A simulation framework for evaluating cost-effective and compute-optimal large lan- guage model training,

    J. Bang, Y . Choi, M. Kim, Y . Kim, and M. Rhu, “vtrain: A simulation framework for evaluating cost-effective and compute-optimal large lan- guage model training,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 153–167

  27. [27]

    ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage , 2025

    S. Shen, T. Bonato, Z. Hu, P. Jordan, T. Chen, and T. Hoefler, “Atlahs: An application-centric network simulator toolchain for ai, hpc, and distributed storage,”arXiv preprint arXiv:2505.08936, 2025

  28. [28]

    Proteus: Simulating the performance of distributed dnn training,

    J. Duan, X. Li, P. Xu, X. Zhang, S. Yan, Y . Liang, and D. Lin, “Proteus: Simulating the performance of distributed dnn training,”IEEE Transactions on Parallel and Distributed Systems, 2024

  29. [29]

    Daydream: Accurately estimating the efficacy of optimizations for {DNN} training,

    H. Zhu, A. Phanishayee, and G. Pekhimenko, “Daydream: Accurately estimating the efficacy of optimizations for {DNN} training,” in2020 USENIX Annual Technical Conference (USENIX ATC 20), 2020, pp. 337–352

  30. [30]

    Echo: Simulating distributed training at scale,

    Y . Feng, Y . Chen, K. Chen, J. Li, T. Wu, P. Cheng, C. Wu, W. Wang, T.-Y . Ho, and H. Xu, “Echo: Simulating distributed training at scale,” arXiv preprint arXiv:2412.12487, 2024

  31. [31]

    Llmcompass: Enabling efficient hardware design for large language model inference,

    H. Zhang, A. Ning, R. B. Prabhakar, and D. Wentzlaff, “Llmcompass: Enabling efficient hardware design for large language model inference,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 1080–1096

  32. [32]

    Chakra: Advancing performance benchmarking and co-design using standardized execution traces, 2023

    S. Sridharan, T. Heo, L. Feng, Z. Wang, M. Bergeron, W. Fu, S. Zheng, B. Coutinho, S. Rashidi, C. Manet al., “Chakra: Advancing performance benchmarking and co-design using standardized execution traces,”arXiv preprint arXiv:2305.14516, 2023

  33. [33]

    Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale,

    J. Cho, M. Kim, H. Choi, G. Heo, and J. Park, “Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale,”arXiv preprint arXiv:2408.05499, 2024

  34. [34]

    Group operation assembly language-a flexible way to express collective communication,

    T. Hoefler, C. Siebert, and A. Lumsdaine, “Group operation assembly language-a flexible way to express collective communication,” in2009 International Conference on Parallel Processing. IEEE, 2009, pp. 574–581

  35. [35]

    Distir: An intermediate representation for optimizing distributed neural networks,

    K. Santhanam, S. Krishna, R. Tomioka, A. Fitzgibbon, and T. Harris, “Distir: An intermediate representation for optimizing distributed neural networks,” inProceedings of the 1st Workshop on Machine Learning and Systems, 2021, pp. 15–23

  36. [36]

    torch. fx: Practical program capture and transformation for deep learning in python,

    J. Reed, Z. DeVito, H. He, A. Ussery, and J. Ansel, “torch. fx: Practical program capture and transformation for deep learning in python,” Proceedings of Machine Learning and Systems, vol. 4, pp. 638–651, 2022

  37. [37]

    Forecasting gpu performance for deep learning training and inference,

    S. Lee, A. Phanishayee, and D. Mahajan, “Forecasting gpu performance for deep learning training and inference,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 493–508

  38. [38]

    Onnx: Open neural network exchange,

    J. Bai, F. Lu, K. Zhanget al., “Onnx: Open neural network exchange,” https://github.com/onnx/onnx, 2019

  39. [39]

    Onnxim: A fast, cycle-level multi-core npu simulator,

    H. Ham, W. Yang, Y . Shin, O. Woo, G. Heo, S. Lee, J. Park, and G. Kim, “Onnxim: A fast, cycle-level multi-core npu simulator,”arXiv preprint arXiv:2406.08051, 2024, available at https://arxiv.org/abs/2406.08051

  40. [40]

    Zigzag: A memory-centric rapid dnn accelerator design space exploration framework,

    L. Mei, P. Houshmand, V . Jain, S. Giraldo, and M. Verhelst, “Zigzag: A memory-centric rapid dnn accelerator design space exploration framework,”arXiv preprint arXiv:2007.11360, 2020

  41. [41]

    Xla architecture and high-level optimizer (hlo),

    OpenXLA Contributors, “Xla architecture and high-level optimizer (hlo),” https://openxla.org/xla/architecture, 2024, accessed: 2025-05-14

  42. [42]

    Stablehlo: A portable, stable, and versioned ir for ml workloads,

    OpenXLA Community, “Stablehlo: A portable, stable, and versioned ir for ml workloads,” https://github.com/openxla/stablehlo, 2023, accessed: 2025-09-24

  43. [43]

    ’sdy’ dialect — openxla project,

    “’sdy’ dialect — openxla project,” https://openxla.org/shardy/sdy dialect, accessed: 2025-12-10

  44. [44]

    Mlir: A compiler infrastructure for the end of moore’s law,

    C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, A. Pienaar, R. Riddle, and T. Shpeisman, “Mlir: A compiler infrastructure for the end of moore’s law,”Proceedings of the IEEE, 2021

  45. [45]

    Maxtext: A simple, performant llm library in jax,

    AI-Hypercomputer, “Maxtext: A simple, performant llm library in jax,” https://github.com/AI-Hypercomputer/maxtext, 2025, accessed: 2025-09- 30

  46. [46]

    The structural simulation toolkit,

    A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “The structural simulation toolkit,”SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, p. 37–42, Mar. 2011. [Online]. Available: https://doi.org/10.1145/1964218.1964225

  47. [47]

    Jax: composable transformations of python+numpy programs,

    Google Research, Brain Team, “Jax: composable transformations of python+numpy programs,” https://github.com/google/jax, 2018, accessed: 2025-09-30

  48. [48]

    Flax: A neural network library and ecosystem for JAX,

    J. Heek, A. Levskaya, A. Oliver, M. Ritter, B. Rondepierre, A. Steiner, and M. van Zee, “Flax: A neural network library and ecosystem for JAX,” 2024. [Online]. Available: http://github.com/google/flax

  49. [49]

    xprof: A profiling and performance analysis tool for machine learning,

    OpenXLA contributors, “xprof: A profiling and performance analysis tool for machine learning,” OpenXLA, gitHub repository, Apache-2.0 license. [Online]. Available: https://github.com/openxla/xprof

  50. [50]

    Large-scale distributed linear algebra with tensor processing units,

    A. G. Lewis, J. Beall, M. Ganahl, M. Hauru, S. B. Mallick, and G. Vidal, “Large-scale distributed linear algebra with tensor processing units,” Proceedings of the National Academy of Sciences, vol. 119, no. 33, p. e2122762119, 2022

  51. [51]

    The current and future of roofline,

    C. Yang, “The current and future of roofline,” 2019, available at https: //www.nersc.gov/assets/Uploads/Talk-LBNL-BrownBagSeminar-2019. pdf

  52. [52]

    Cocossim: A cycle-accurate simulator for heterogeneous systolic array architectures,

    M. Choudhary, C. Kjellqvist, J. Ma, and L. W. Wills, “Cocossim: A cycle-accurate simulator for heterogeneous systolic array architectures,” in2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2025, pp. 174–185

  53. [53]

    A domain-specific supercomputer for training deep neural networks,

    N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, D. A. Pattersonet al., “A domain-specific supercomputer for training deep neural networks,”Communications of the ACM, vol. 63, no. 7, pp. 67–78, 2020

  54. [54]

    Technology-driven, highly- scalable dragonfly topology,

    J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly- scalable dragonfly topology,”ACM SIGARCH Computer Architecture News, vol. 36, no. 3, pp. 77–88, 2008

  55. [55]

    In-datacenter performance analysis of a tensor processing unit,

    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

  56. [56]

    Scale-sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis,

    R. Raj, S. Banerjee, N. Chandra, Z. Wan, J. Tong, A. Samajdar, and T. Krishna, “Scale-sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis,”arXiv preprint arXiv:2504.15377, 2025, available at https://arxiv.org/abs/2504.15377

  57. [57]

    Zigzag: Enlarging joint architecture-mapping design space exploration for dnn accelerators,

    L. Mei, P. Houshmand, V . Jain, S. Giraldo, and M. Verhelst, “Zigzag: Enlarging joint architecture-mapping design space exploration for dnn accelerators,”IEEE Transactions on Computers, vol. 70, no. 8, pp. 1160– 1174, 2021

  58. [58]

    Iree: An mlir- based compiler and runtime for machine learning models,

    T. I. Authors, B. Vanik, and S. Laurenzo, “Iree: An mlir- based compiler and runtime for machine learning models,” 2019, software available at https://github.com/iree-org/iree. [Online]. Available: https://github.com/iree-org/iree

  59. [59]

    Pytorchsim: A comprehensive, fast, and accurate npu simulation framework,

    W. Yang, Y . Shin, O. Woo, G. Park, H. Ham, J. Kang, J. Park, and G. Kim, “Pytorchsim: A comprehensive, fast, and accurate npu simulation framework,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1363–1380. [Online]. Available: https://doi....

  60. [60]

    The source code is planned to be open-sourced on GitHub at https://github.com/imec-int/hespas

    How to access:The artifact for results reproduction is available at https://doi.org/10.5281/zenodo.18874691. The source code is planned to be open-sourced on GitHub at https://github.com/imec-int/hespas

  61. [61]

    To reproduce the profiling-based estimator results, matching GPU and TPU hardware is required as specified in Table IV and Figure 5

    Hardware dependencies:The core simulation framework can be run on any Linux-based system. To reproduce the profiling-based estimator results, matching GPU and TPU hardware is required as specified in Table IV and Figure 5

  62. [62]

    Software dependencies:The core simulation framework is written in Python 3.10+. For network simulation, it uses ASTRA-sim and therefore depends on the following two repositories: •ASTRA-sim [23] for network modeling, •Chakra [32] as the trace format input for ASTRA-sim. The following systolic array estimators are utilized as compute estimators in the TPU ...

  63. [63]

    All datasets are included in the release package

    Data sets:The simulation framework takes two types of inputs: •System configuration files, •StableHLO ML workloads. All datasets are included in the release package. D. Installation The framework is released as a self-contained package. Upon extraction, the root directory contains a single bash script that installs all required dependencies and executes t...