pith. sign in

arxiv: 2606.26633 · v1 · pith:LXGPDC66new · submitted 2026-06-25 · 💻 cs.DC

Simulating Unified Tensor Resharding in heterogeneous AI systems

Pith reviewed 2026-06-26 03:56 UTC · model grok-4.3

classification 💻 cs.DC
keywords heterogeneous AI trainingLLM simulatordistributed trainingtensor parallelismpipeline parallelismworkload partitioningcollective communicationtensor resharding
0
0 comments X

The pith

Xsim simulates heterogeneous LLM training and predicts times with under 5% error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing simulators for distributed AI training assume uniform compute and network resources across all devices. Real deployments increasingly use heterogeneous hardware because of multimodal models, cloud hardware scarcity, and geographically spread enterprise setups. Xsim introduces heterogeneity-aware features including non-uniform workload partitioning, customized ring construction for collectives, and reusable abstractions for pipeline parallelism plus tensor resharding. The simulator integrates flexibly with network engines and reports concrete accuracy on real-world configurations. If the predictions hold, teams can evaluate deployment plans on mixed hardware without repeated physical runs.

Core claim

Xsim is a heterogeneity-aware simulator for distributed LLM training that supports load balancing through non-uniform workload partitioning across heterogeneous device groups, heterogeneity-aware collective communication via customized ring construction and chunk partitioning, reusable abstractions for emerging pipeline-parallel algorithms and non-uniform tensor resharding, flexible inputs for custom device groups and parallelism mappings, and pluggable integration with NS-3 and htsim, delivering training-time predictions with less than 5% error across most heterogeneous data-parallel and tensor-parallel configurations and around 2% error when modeling pipeline-parallel communication.

What carries the argument

Xsim simulator and its abstractions for non-uniform workload partitioning, customized ring construction, and chunk partitioning that enable heterogeneity-aware collective communication and tensor resharding.

If this is right

  • Training time can be predicted within 5% error for heterogeneous data-parallel and tensor-parallel setups.
  • Pipeline-parallel communication modeling reaches around 2% error.
  • Metrics such as pipeline bubble time and straggler waiting time become available for inspection.
  • Deployment plans with custom device groups and device-to-parallelism mappings can be specified and evaluated.
  • Users can trade simulation fidelity for speed by choosing between NS-3 and htsim backends.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could iterate on mixed-hardware allocation strategies entirely in simulation before committing physical resources.
  • The same abstractions might be reused to compare alternative ring constructions or chunk sizes without new code.
  • Applying the simulator to MoE or multimodal workloads that deliberately exploit device differences would test its generality.
  • Comparing simulated versus measured times on a geographically distributed cluster would check whether network heterogeneity is captured at scale.

Load-bearing premise

The implemented abstractions for non-uniform workload partitioning, customized ring construction, and chunk partitioning will produce predictions that match actual hardware behavior.

What would settle it

Run Xsim on a specific heterogeneous data-parallel or tensor-parallel configuration, measure the real training time on the corresponding hardware cluster, and observe whether the absolute percentage error exceeds 5%.

Figures

Figures reproduced from arXiv: 2606.26633 by Abed Mohammad Kamaluddin, Kushal Mitra, Meet Dadhania, Praveen Tammana, Rinku Shah, Rohan Sudhir Basugade, Satananda Burla, Sayantan Dasgupta, Sumit Kumar.

Figure 2
Figure 2. Figure 2: Comparison of SOTA tensor resharding ap￾proaches for transferring a 12-element global tensor from a source stage (𝑇 𝑃 = 6) to a destination stage (𝑇 𝑃 = 4). (a) Het￾Auto employs a 3-phase hierarchical approach (Gather → P2P → Scatter) that groups devices into two virtual clusters, each governed by GCD(6, 4) = 2, and routes traffic through leaders. (b) AlpaComm establishes direct point-to-point con￾nections… view at source ↗
Figure 4
Figure 4. Figure 4: Heterogeneity-aware multi-ring LCM-based re￾sharding in 𝑋𝑠𝑖𝑚 for non-uniform layer partitioning across high-compute (blue) and low-compute (red) device groups with layer-aware DP groups. that can generate and map multiple workloads, rather than broadcasting a single static workload across the cluster. 𝑋𝑠𝑖𝑚’s Asymmetric Workload Generator (𝑆𝑈𝑇 𝑅𝐴𝐴𝑊 𝐺 ) ingests heterogeneity-aware framework and model parame￾… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of contributions to htsim implementa￾tion. communication groups and chunking derived using the LCM￾based synchronization algorithms. After local computation completes, devices in the DP Bar￾rier Group (DBG) trigger gradient synchronization across data-parallel replicas. Together, the PBG and DBG abstrac￾tions enable faithful simulation of pipeline execution, inter￾stage communication, and heteroge… view at source ↗
Figure 6
Figure 6. Figure 6: Training time per iteration for Llama 7B on a heterogeneous cluster shows that𝑋𝑠𝑖𝑚 closely matches real hard￾ware with <5% error, while SimAI incurs large errors due to explicit heterogeneity modeling. 10 3 10 4 Llama 7B C9 C11 C12 C10 Cluster Configuration 10 3 10 4 Llama 13B Training time per iteration (ms) (in log scale) Hexiscale Real Xsim [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of isolated scale-up TP collective communication time between 𝑋𝑠𝑖𝑚 (using NS-3 and htsim backends) and real cluster of 8× H200 NVLink nodes with an average error of 5.5%. grad_ gather grad_ param grad_ gather grad_ param grad_ gather grad_ param grad_ gather grad_ param grad_ gather grad_ param Layers for DP communication 10 8 10 9 Collective Communication Time (ns) Llama 7B Llama 13B Llama 70B … view at source ↗
Figure 12
Figure 12. Figure 12: Total training time & exposed PP communication time comparison across asymmetric topology pairs compar￾ing different re-sharding technique algorithms. 𝐻100 × 6 → 𝐴100 × 4, 𝐻100 × 8 → 𝐴100 × 1, and 𝐻100 × 4 → 𝐴100 × 4 [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Input Specification for the example deployment configuration. processes this parameters along with the device group con￾figurations to generate workload file per device group. Each of the workload file contains the training iteration struc￾ture: a header specifying the tensor parallel degree, pipeline parallel stages, gradient accumulation steps and embedding parameters. The header is then followed by wor… view at source ↗
Figure 14
Figure 14. Figure 14: Heterogeneity-aware multi-ring LCM-based resharding in 𝑋𝑠𝑖𝑚 for non-uniform layer partitioning across high￾compute (blue) and low-compute (red) device groups with layer-aware DP groups. Step 1: Which Device Groups participate? For the layer range [1, 15], the sweep-line algorithm has already determined that 𝐷𝐺0 and 𝐷𝐺2 will be synchronizing gradients. Algorithm 2 therefore takes these 2 Device Groups as t… view at source ↗
Figure 15
Figure 15. Figure 15: Training time per iteration for Llama 7B and 13B across homoge￾neous cluster sizes shows that 𝑋𝑠𝑖𝑚 closely matches SimAI, with a relative error of 0.1–2.2%. 8 16 128 512 Cluster size 0 1 2 3 4 5 Simulation runtime (per iteration) (hour) Llama 7B SimAI Xsim [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 18
Figure 18. Figure 18: GPU idle time and training time per iteration across cluster configurations ((C13–C15) [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
read the original abstract

State-of-the-art AI training simulators assume homogeneous compute and network infrastructure. However, real-world training infrastructure is becoming increasingly heterogeneous since: (a) Model architectures such as multimodal and MoE exploit heterogeneity to improve device utilization, (b) Public cloud platforms often provide limited availability of homogeneous hardware due to fast hardware evolution, and (c) Large enterprises frequently deploy geographically distributed infrastructure that is both diverse and heterogeneous. In this paper, we present Xsim, a heterogeneity-aware simulator for distributed LLM training. Xsim supports: (i) Load balancing through non-uniform workload partitioning across heterogeneous device groups, (ii) Heterogeneity-aware collective communication via customized ring construction and chunk partitioning, (iii) Reusable heterogeneity-aware abstractions for emerging pipeline-parallel algorithms and non-uniform tensor resharding technique, (iv) Flexible input abstractions for specifying deployment plans with custom device groups and custom device-to-parallelism mappings, and (v) Pluggable integration with NS-3 and htsim, allowing users to trade off simulation fidelity for performance and scalability. Our evaluation demonstrates that Xsim accurately predicts training time for real-world heterogeneous deployments, with an error of less than 5% across most heterogeneous data-parallel/tensor-parallel configurations and around 2% error with pipeline-parallel communication modeling. We expose actionable metrics such as pipeline bubble time and straggler waiting time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Xsim, a heterogeneity-aware simulator for distributed LLM training. It supports non-uniform workload partitioning across device groups, heterogeneity-aware collective communication via customized ring construction and chunk partitioning, reusable abstractions for pipeline-parallel algorithms and non-uniform tensor resharding, flexible input for custom device-to-parallelism mappings, and pluggable backends (NS-3, htsim). The central claim is that Xsim predicts training times for real heterogeneous deployments with <5% error on most data-parallel/tensor-parallel configurations and ~2% error on pipeline-parallel communication modeling, while exposing metrics such as pipeline bubble time and straggler waiting time.

Significance. If the accuracy claims are substantiated with concrete heterogeneous hardware validation, the simulator would address a practical gap in modeling real-world AI training infrastructure that is increasingly heterogeneous due to cloud availability, multimodal/MoE models, and geo-distributed deployments. The pluggable network backends and reusable abstractions for emerging parallelism patterns are potentially useful contributions to the distributed systems simulation literature.

major comments (1)
  1. [Abstract] Abstract: the claim that Xsim 'accurately predicts training time ... with an error of less than 5% across most heterogeneous data-parallel/tensor-parallel configurations' is load-bearing for the paper's contribution, yet the abstract (and the provided manuscript excerpt) supplies no list of tested device mixes, no description of ground-truth runtime collection on real heterogeneous clusters, no indication of how link bandwidths or device performance were measured, and no discussion of whether the NS-3/htsim backends were calibrated against the same hardware. This prevents assessment of whether the reported errors reflect actual hardware match or only internal consistency checks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater clarity in the abstract regarding our accuracy claims. We address the comment below and will revise the manuscript to strengthen the presentation of our validation methodology.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that Xsim 'accurately predicts training time ... with an error of less than 5% across most heterogeneous data-parallel/tensor-parallel configurations' is load-bearing for the paper's contribution, yet the abstract (and the provided manuscript excerpt) supplies no list of tested device mixes, no description of ground-truth runtime collection on real heterogeneous clusters, no indication of how link bandwidths or device performance were measured, and no discussion of whether the NS-3/htsim backends were calibrated against the same hardware. This prevents assessment of whether the reported errors reflect actual hardware match or only internal consistency checks.

    Authors: We agree that the abstract, constrained by length, omits key details needed to substantiate the central accuracy claim. The full manuscript's evaluation section describes the tested device mixes, ground-truth collection on physical heterogeneous clusters, bandwidth and performance measurements, and backend calibration against the same hardware. To directly address the concern and allow readers to evaluate the claims immediately, we will revise the abstract to include a concise statement on the validation approach and hardware configurations used. This change will clarify that the reported errors derive from hardware comparisons rather than solely internal checks. revision: yes

Circularity Check

0 steps flagged

No circularity: simulator validation relies on external hardware comparison, not self-referential fitting or definitions

full rationale

The paper introduces Xsim as a new simulator implementing heterogeneity-aware abstractions (non-uniform partitioning, customized rings, chunking, pluggable NS-3/htsim backends) and reports empirical prediction accuracy (<5% error on data/tensor-parallel configs, ~2% on pipeline) against real heterogeneous deployments. No equations, fitted parameters, or self-citations are presented that would make the accuracy claim equivalent to its inputs by construction. The load-bearing step is an external benchmark comparison, which is independent of the simulator's internal definitions. This matches the default case of a systems tool paper whose central claim does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities can be identified from the abstract alone.

pith-pipeline@v0.9.1-grok · 5807 in / 989 out tokens · 59599 ms · 2026-06-26T03:56:13.493670+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    Meta AI. 2025. The LLaMA 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation.https://ai .meta.com/blog/ llama-4-multimodal-intelligence/Accessed from Meta AI Blog on 2026-02-06. 12

  2. [2]

    Daiyaan Arfeen, Dheevatsa Mudigere, Ankit More, Bhargava Gopireddy, Ahmet Inci, and Gregory R Ganger. 2025. Nonuniform- tensor-parallelism: Mitigating gpu failure impact for scaled-up llm training.arXiv preprint arXiv:2504.06095(2025)

  3. [3]

    ascentoptics. 2025. InfiniBand vs RoCE network fabrics: RDMA inter- connect comparisons.https://ascentoptics .com/blog/infiniband-vs- roce-which-is-better-suited-for-ai-data-center-networks/InfiniBand delivers engineered lossless RDMA fabrics with ultra-low latency and high throughput (up to 400 800 Gbps+ per port), while RoCE (espe- cially RoCEv2) brings...

  4. [4]

    Songyuan Bai, Hao Zheng, Chen Tian, Xiaoliang Wang, Chang Liu, Xin Jin, Fu Xiao, Qiao Xiang, Wanchun Dou, and Guihai Chen. 2024. Unison: a parallel-efficient and user-transparent network simulation kernel. InProceedings of the Nineteenth European Conference on Com- puter Systems. 115–131

  5. [5]

    Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, and Minsoo Rhu. 2024. vtrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 153–167

  6. [6]

    Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, and Jongse Park

  7. [7]

    In2024 IEEE International Symposium on Workload Characterization (IISWC)

    LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale. In2024 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 15–29

  8. [8]

    2021.Apsara Conference 2021 | Alibaba Cloud Re- leased the Fourth-Generation SHENLONG Architecture.https:// www.alibabacloud.com/blog/598193Accessed: 2026-02-06

    Alibaba Cloud. 2021.Apsara Conference 2021 | Alibaba Cloud Re- leased the Fourth-Generation SHENLONG Architecture.https:// www.alibabacloud.com/blog/598193Accessed: 2026-02-06

  9. [9]

    2026.Heterogeneous Comput- ing.https://www .alibabacloud.com/en/product/ heterogeneous_computing?_p_lcAccessed: 2026-02-06

    Alibaba Cloud. 2026.Heterogeneous Comput- ing.https://www .alibabacloud.com/en/product/ heterogeneous_computing?_p_lcAccessed: 2026-02-06

  10. [10]

    2026.GPU networking overview.https: //docs.cloud.google.com/ai-hypercomputer/docs/networking- overviewAccessed: 2026-02-06

    Google Cloud. 2026.GPU networking overview.https: //docs.cloud.google.com/ai-hypercomputer/docs/networking- overviewAccessed: 2026-02-06

  11. [11]

    ConnectX-7 Datasheet.https: //www.nvidia.com/content/dam/en-zz/Solutions/networking/ ethernet-adapters/connectx-7-datasheet-Final .pdf

    NVIDIA Corporation. ConnectX-7 Datasheet.https: //www.nvidia.com/content/dam/en-zz/Solutions/networking/ ethernet-adapters/connectx-7-datasheet-Final .pdf. Accessed: 2026-02-06

  12. [12]

    2018.NCCL Developer Guide: Collec- tive Communication Primitives (Version 2.0.5)

    NVIDIA Corporation. 2018.NCCL Developer Guide: Collec- tive Communication Primitives (Version 2.0.5). NVIDIA Corpora- tion.https://docs .nvidia.com/deeplearning/nccl/archives/nccl_205/ nccl-developer-guide/index.html

  13. [13]

    NVIDIA Corporation. 2026. NVIDIA DGX B200: The Foundation for Your AI Factory.https://www .nvidia.com/en-in/data-center/dgx- b200/. Accessed: 2026-06-11

  14. [14]

    NVIDIA Corporation. 2026. NVIDIA H200 GPU.https:// www.nvidia.com/en-in/data-center/h200/. Accessed: 2026-06-10

  15. [15]

    HTSim Network Simulator.https://github .com/ Broadcom/csg-htsimAccessed from GitHub repository

    Broadcom CSG. HTSim Network Simulator.https://github .com/ Broadcom/csg-htsimAccessed from GitHub repository. Accessed: 2026-02-06

  16. [16]

    2025.Set Up a gRPC Service — Ray Serve gRPC Guide.https://docs .ray.io/en/latest/serve/advanced-guides/grpc- guide.htmlAccessed: 7 Feb 2026

    Ray Documentation. 2025.Set Up a gRPC Service — Ray Serve gRPC Guide.https://docs .ray.io/en/latest/serve/advanced-guides/grpc- guide.htmlAccessed: 7 Feb 2026

  17. [17]

    Meta Engineering. 2023. Arcadia: An end-to-end AI system performance simulator.https://engineering .fb.com/2023/09/07/ data-infrastructure/arcadia-end-to-end-ai-system-performance- simulator/Meta Engineering Blog Accessed: 2026-02-06

  18. [18]

    Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, and Hong Xu

  19. [19]

    Echo: Simulating Distributed Training At Scale.arXiv preprint arXiv:2412.12487(2024)

  20. [20]

    Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, et al . 2024. Rdma over eth- ernet for distributed training at meta scale. InProceedings of the ACM SIGCOMM 2024 Conference. 57–70

  21. [21]

    Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W Mahoney, and Kurt Keutzer. 2024. AI and memory wall.IEEE Micro (2024)

  22. [22]

    aliyun/aicb at d9b4f5cd7d9d34a80cfbb0389831a16c7fe3ed7b.https://github .com/ aliyun/aicb/tree/d9b4f5cd7d9d34a80cfbb0389831a16c7fe3ed7b Accessed: 2026-02-06

    Alibaba Group. aliyun/aicb at d9b4f5cd7d9d34a80cfbb0389831a16c7fe3ed7b.https://github .com/ aliyun/aicb/tree/d9b4f5cd7d9d34a80cfbb0389831a16c7fe3ed7b Accessed: 2026-02-06

  23. [23]

    Fei Gui, Kaihui Gao, Li Chen, Dan Li, Vincent Liu, Ran Zhang, Hong- bing Yang, and Dian Xiong. 2025. Accelerating Design Space Explo- ration for {LLM} Training Systems with Multi-experiment Parallel Simulation. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 473–488

  24. [24]

    Runsheng Benson Guo, Utkarsh Anand, Arthur Chen, and Khuzaima Daudjee. 2024. Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models.arXiv preprint arXiv:2411.01075(2024)

  25. [25]

    supercomputer- scale

    Tom’s Hardware. 2025.Microsoft deploys world’s first “supercomputer- scale” GB300 NVL72 Azure cluster — 4,608 GB300 GPUs linked together to form a single, unified accelerator capable of 1.44 PFLOPS of inference.https://www .tomshardware.com/tech-industry/artificial- intelligence/microsoft-deploys-worlds-first-supercomputer-scale- gb300-nvl72-azure-cluster...

  26. [26]

    Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training.arXiv preprint arXiv:1806.03377(2018)

  27. [27]

    Chenyang Hei, Jiayi Li, Jiamin Cao, Chengxi Gao, Xiuzhu Sha, Tongrui Liu, Dengke Zhang, Ennan Zhai, and Xingwei Wang. 2026. HeteCCL: Synthesizing Near-Optimal Collective Communication Schedules for Heterogeneous GPU Clusters. In23rd USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 26). USENIX Associ- ation, Renton, WA, 2533–2551.htt...

  28. [28]

    Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Asso- ciation, Renton, WA, 947–960.https://www .usenix.org/conference/ atc19/presentation/jeon

  29. [29]

    Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, et al. 2022. Whale: Efficient giant model training over heterogeneous {GPUs }. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 673–688

  30. [30]

    1994.Implementation of a Sweep Line Algorithm for the Straight Line Segment Intersection Problem

    Stefan Näher Kurt Mehlhorn. 1994.Implementation of a Sweep Line Algorithm for the Straight Line Segment Intersection Problem. MAX- PLANCK-INSTITUT••FUR INFORMATIK.https://pure .mpg.de/rest/ items/item_1834220_3/component/file_2035159/content

  31. [31]

    Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecast- ing GPU performance for deep learning training and inference. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 493–508

  32. [32]

    Wenkai Li, Ran Shu, Peng Zhang, and Yongqiang Xiong. 2025. Nüwa: Efficient Generative Control Plane for AI Network Simulation. In Proceedings of the 9th Asia-Pacific Workshop on Networking. 121–127

  33. [33]

    Ji Liu, Zhihua Wu, Danlei Feng, Minxu Zhang, Xinxuan Wu, Xuefeng Yao, Dianhai Yu, Yanjun Ma, Feng Zhao, and Dejing Dou. 2023. Heterps: Distributed deep learning with reinforcement learning based sched- uling in heterogeneous environments.Future Generation Computer Systems148 (2023), 106–117. 13

  34. [34]

    Fei Long, Kaihui Gao, Li Chen, Dan Li, Yiwei Zhang, Fei Gui, Yitao Xing, Wenjia Wei, and Bingyang Liu. 2026. Supercharging Packet-level Network Simulation of Large Model Training via Memoization and Fast-Forwarding.arXiv preprint arXiv:2602.10615(2026)

  35. [35]

    Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 586–602

  36. [36]

    Meta’s Infrastructure Evolution and the Advent of AI.https://engineering .fb.com/2025/09/29/data-infrastructure/metas- infrastructure-evolution-and-the-advent-of-ai/

    Meta. Meta’s Infrastructure Evolution and the Advent of AI.https://engineering .fb.com/2025/09/29/data-infrastructure/metas- infrastructure-evolution-and-the-advent-of-ai/. Accessed: 2026-02- 06

  37. [37]

    Zizhao Mo, Huanle Xu, and Chengzhong Xu. 2024. Heet: Accelerating elastic training in heterogeneous deep learning clusters. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 499–513

  38. [38]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, and Bryan Catanzaro. 2021.Scaling Language Model Training to a Trillion Pa- rameters Using Megatron.https://developer .nvidia.com/blog/scaling- language-model-training-to-a-trillion-parameters-using-megatron/ Accessed: 2026-02-06

  39. [39]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. InProceedings of the international conference for high performance computing, netwo...

  40. [40]

    RDMA over Converged Ethernet (RoCE) v2

    NVIDIA Networking. RDMA over Converged Ethernet (RoCE) v2. https://docs.nvidia.com/doca/archive/2-10-0/rdma-over-converged- ethernet/index.htmlAccessed: 2026-02-06

  41. [41]

    Chengyi Nie, Jessica Maghakian, and Zhenhua Liu. 2024. Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous Clusters. InProceedings of the 25th International Middleware Conference. 299–312

  42. [42]

    DGX SuperPOD Reference Architecture: DGX H100

    NVIDIA. DGX SuperPOD Reference Architecture: DGX H100. https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod- reference-architecture-dgx-h100.pdf. Accessed: 2026-02-06

  43. [43]

    NVIDIA A100 TENSOR CORE GPU

    Nvidia. NVIDIA A100 TENSOR CORE GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data- Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4- web.pdf. Accessed: 2026-02-06

  44. [44]

    NVIDIA H100 Tensor Core GPU

    NVIDIA. NVIDIA H100 Tensor Core GPU. https://www.nvidia.com/en- in/data-center/h100/. Accessed: 2026-02-06

  45. [45]

    2020.{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU } clusters through integration of pipelined model parallelism and data parallelism

    Jay H Park, Gyeongchan Yun, M Yi Chang, Nguyen T Nguyen, Seung- min Lee, Jaesik Choi, Sam H Noh, and Young-ri Choi. 2020.{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU } clusters through integration of pipelined model parallelism and data parallelism. In2020 USENIX Annual Technical Conference (USENIX ATC 20). 307–321

  46. [46]

    Guicheng Qi, Junwei Su, Liqi Yang, Tao Li, Tingwen Xie, Yerui Sun, Yuchen Xie, and Chuan Wu. 2026. HetAuto: Cross-Cluster Auto- Parallelism for Heterogeneous Distributed Training. InProceedings of the 21st European Conference on Computer Systems. 759–779

  47. [47]

    Yicheng Qian, Ran Shu, Rui Ma, Yang Wang, Derek Chiou, Nadeen Gebara, Luca Piccolboni, Miriam Leeser, and Yongqiang Xiong. 2025. Miniature: Fast AI Supercomputer Networks Simulation on FPGAs. In Proceedings of the 9th Asia-Pacific Workshop on Networking. 114–120

  48. [48]

    Jianxing Qin, Jingrong Chen, Xinhao Kong, Yongji Wu, Tianjun Yuan, Liang Luo, Zhaodong Wang, Ying Zhang, Tingjun Chen, Alvin R Lebeck, et al. 2025. Phantora: Maximizing Code Reuse in Simulation- based Machine Learning System Performance Estimation.arXiv preprint arXiv:2505.01616(2025)

  49. [49]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He

  50. [50]

    InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3505–3506

  51. [51]

    sagemaker.https://aws .amazon.com/blogs/machine- learning/improve-price-performance-of-your-model-training- using-amazon-sagemaker-heterogeneous-clusters/

    sagemaker. sagemaker.https://aws .amazon.com/blogs/machine- learning/improve-price-performance-of-your-model-training- using-amazon-sagemaker-heterogeneous-clusters/. Accessed: 2026-02-06

  52. [52]

    Amazon Web Services. 2025. Amazon Virtual Private Cloud (VPC) Overview.https://docs .aws.amazon.com/vpc/latest/userguide/what- is-amazon-vpc.htmlAccessed: 2026-02-06

  53. [53]

    Siyuan Shen, Tommaso Bonato, Zhiyi Hu, Pasquale Jordan, Tiancheng Chen, and Torsten Hoefler. 2025. Atlahs: An application-centric net- work simulator toolchain for ai, hpc, and distributed storage. InPro- ceedings of the International Conference for High Performance Comput- ing, Networking, Storage and Analysis. 349–367

  54. [54]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

  55. [55]

    Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Ashmitha Jeevaraj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lumezanu, Rui M...

  56. [56]

    Arjun Singhvi, Nandita Dukkipati, Prashant Chandra, Hassan MG Wassel, Naveen Kr Sharma, Anthony Rebello, Henry Schuh, Praveen Kumar, Behnam Montazeri, Neelesh Bansod, et al . 2025. Falcon: A reliable, low latency hardware transport. InProceedings of the ACM SIGCOMM 2025 Conference. 248–263

  57. [57]

    Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, et al. 2023. Chakra: Advancing performance bench- marking and co-design using standardized execution traces.arXiv preprint arXiv:2305.14516(2023)

  58. [58]

    Foteini Strati, Zhendong Zhang, George Manos, Ixeia Sánchez Périz, Qinghao Hu, Tiancheng Chen, Berk Buzcu, Song Han, Pamela Delgado, and Ana Klimovic. 2025. Sailor: Automating distributed training over dynamic, heterogeneous, and geo-distributed clusters. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 204–220

  59. [59]

    Yinan Tang, Tongtong Yuan, Fang Cao, Li Wang, Zhenhua Guo, Yaqian Zhao, and Rengang Li. 2024. Simulating llm training in cxl-based het- erogeneous computing cluster. InIEEE INFOCOM 2024-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 1–6

  60. [60]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

  61. [61]

    Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, Mohd Muzzammil, and Myeong- jae Jeon. 2024. Metis: Fast Automatic Distributed Training on Het- erogeneous {GPUs }. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 563–578

  62. [62]

    Wenyi Wang, Zheng Wu, Yanmeng Wang, Haolin Mao, Lei Han, Gao- gang Xie, and Fu Xiao. 2026. HyGra: Accelerating Network-State 14 Simulation for LLM Training in DCNs via Adaptive Packet-Flow Gran- ularity.arXiv preprint arXiv:2603.12671(2026)

  63. [63]

    Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Dan Li, Li Chen, Heyang Zhou, Linkang Zheng, Sen Zhang, Yikai Zhu, et al. 2025. {SimAI}: Uni- fying Architecture Design and Performance Tuning for {Large-Scale} Large Language Model Training with Scalability and Precision. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 541–558

  64. [64]

    Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Het- erogeneous GPU Clusters. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 945–960.https://www .usen...

  65. [65]

    William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Su- darshan Srinivasan, and Tushar Krishna. 2023. Astra-sim2. 0: Model- ing hierarchical networks and disaggregated systems for large-model training at scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 283–294

  66. [66]

    Ruilong Wu, Xinjiao Li, Yisu Wang, Xinyu Chen, and Dirk Kutscher

  67. [67]

    InProceedings of the 9th Asia-Pacific Workshop on Networking

    Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization. InProceedings of the 9th Asia-Pacific Workshop on Networking. 164–171

  68. [68]

    Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z Morley Mao, Matthew Lentz, Danyang Zhuo, and Ion Stoica. 2025. HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs.arXiv preprint arXiv:2504.03871(2025)

  69. [69]

    Ran Yan, Youhe Jiang, Xiaonan Nie, Fangcheng Fu, Bin Cui, and Binhang Yuan. 2024. HexiScale: Accommodating Large Language Model Training over Heterogeneous Environment.arXiv preprint arXiv:2409.01143(2024)

  70. [70]

    Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, and Binhang Yuan. 2024. Flashflex: Accommodating large language model training over heterogeneous environment.arXiv e-prints(2024), arXiv– 2409

  71. [71]

    Shengyuan Ye, Liekang Zeng, Xiaowen Chu, Guoliang Xing, and Xu Chen. 2024. Asteroid: Resource-efficient hybrid pipeline parallelism for collaborative DNN training on heterogeneous edge devices. InProceed- ings of the 30th Annual International Conference on Mobile Computing and Networking. 312–326

  72. [72]

    Xiaodong Yi, Shiwei Zhang, Ziyue Luo, Guoping Long, Lansong Diao, Chuan Wu, Zhen Zheng, Jun Yang, and Wei Lin. 2020. Optimizing distributed training deployment in heterogeneous GPU clusters. In Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies. 93–107

  73. [73]

    Xiaofei Yue, Fangming Zhao, Fulun Ye, Jiongchi Yu, Zhaoxuan Li, Tingting Li, Ziming Zhao, and Jianwei Yin. 2026. HeteroSim: Towards High-Fidelity Heterogeneous LLM Training Simulation on GPUs. In Proceedings of the ACM Web Conference 2026. 5189–5197

  74. [74]

    Jinghui Zhang, Geng Niu, Qiangsheng Dai, Haorui Li, Zhihua Wu, Fang Dong, and Zhiang Wu. 2023. PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters.Neurocomputing555 (2023), 126661

  75. [75]

    Shiwei Zhang, Lansong Diao, Chuan Wu, Zongyan Cao, Siyu Wang, and Wei Lin. 2024. Hap: Spmd dnn training on heterogeneous gpu clusters with automated program synthesis. InProceedings of the Nine- teenth European Conference on Computer Systems. 524–541

  76. [76]

    Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Daxin Jiang, and Xin Jin. 2025. DistTrain: Ad- dressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models. InProceedings of the ACM SIGCOMM 2025 Conference. 24–38

  77. [77]

    Liangyu Zhao, Saeed Maleki, Yuanhong Wang, Zezhou Wang, Ziyue Yang, Hossein Pourreza, and Arvind Krishnamurthy. 2024. Forest- coll: throughput-optimal collective communications on heterogeneous network fabrics.arXiv preprint arXiv:2402.06787(2024)

  78. [78]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al

  79. [79]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277(2023)

  80. [80]

    How do we find all these subsets?

    Yonghao Zhuang, Lianmin Zheng, Zhuohan Li, Eric Xing, Qirong Ho, Joseph Gonzalez, Ion Stoica, Hao Zhang, and Hexu Zhao. 2023. On optimizing the communication of model parallelism.Proceedings of Machine Learning and Systems5 (2023), 526–540. A Input Specification Sample The simulation framework mainly works with three input specifications which define our ...

Showing first 80 references.