pith. sign in

arxiv: 2409.01143 · v3 · pith:YZU5KN4Inew · submitted 2024-09-02 · 💻 cs.DC

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

Pith reviewed 2026-05-23 21:19 UTC · model grok-4.3

classification 💻 cs.DC
keywords heterogeneous GPUsLLM trainingmodel parallelismasymmetric allocationgraph partitioningdistributed systemsresource utilizationconstrained optimization
0
0 comments X

The pith

HexiScale enables LLM training on mixed GPUs by asymmetrically partitioning computations in data, pipeline and tensor parallelism, matching homogeneous performance while delivering 1.5 to 2.4 times higher throughput than prior heterogeneous

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HexiScale, a system that trains large language models across clusters of different GPUs instead of requiring identical high-performance units. It supports uneven splits of the work in data parallelism, pipeline parallelism and tensor parallelism, then casts the assignment of those splits to specific GPUs as a constrained optimization problem. An efficient hierarchical graph partitioning algorithm solves the problem so that faster GPUs receive proportionally more computation while communication costs stay manageable. If the approach holds, organizations could train models using whatever accelerators they already own rather than purchasing uniform fleets, without sacrificing speed relative to equal-total-FLOPS homogeneous clusters. Experiments on 7B to 30B models confirm the system reaches parity with homogeneous baselines of matching theoretical compute and exceeds existing heterogeneous methods by the reported factor.

Core claim

HexiScale supports asymmetric partition of training computations across heterogeneous GPUs in the scope of data-, pipeline-, and tensor model parallelism. It formalizes the allocation as a constrained optimization problem and solves it with a hierarchical graph partitioning algorithm that fully leverages available computational power, yielding throughput comparable to state-of-the-art homogeneous baselines on equal-FLOPS GPU sets and 1.5× to 2.4× higher throughput than state-of-the-art heterogeneous baselines on the same mixed clusters for models ranging from 7B to 30B parameters.

What carries the argument

The hierarchical graph partitioning algorithm that solves the constrained optimization problem for asymmetric allocation of training computations across heterogeneous GPUs while controlling communication and synchronization overheads.

If this is right

  • LLM training becomes possible on clusters containing mixed GPU generations or vendors without requiring replacement of the entire set.
  • Total cluster utilization rises because every GPU receives a share of work proportional to its speed rather than being limited by the slowest device.
  • The same allocation method produces consistent speedups across model scales from 7B to 30B parameters.
  • Heterogeneous clusters can deliver training throughput within the range of homogeneous clusters that have identical aggregate floating-point capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Operators could incrementally add newer GPUs to an existing cluster and still obtain near-linear scaling without re-purchasing the older units.
  • The optimization formulation might be adapted to other distributed workloads such as distributed inference or scientific simulation codes that already use multiple parallelism styles.
  • Dynamic re-partitioning could be added later so the system reacts automatically when GPUs are added, removed, or experience thermal throttling.

Load-bearing premise

The hierarchical graph partitioning algorithm can solve the allocation problem fast enough and with low enough communication overhead that the gains from better compute distribution are not erased by extra synchronization costs.

What would settle it

Measure end-to-end training throughput of HexiScale on a heterogeneous GPU cluster against a homogeneous cluster whose total theoretical FLOPS match; if throughput falls materially below the homogeneous case or fails to exceed other heterogeneous systems by at least 1.5×, the central performance claims do not hold.

Figures

Figures reproduced from arXiv: 2409.01143 by Bin Cui, Binhang Yuan, Fangcheng Fu, Ran Yan, Xiaonan Nie, Youhe Jiang.

Figure 1
Figure 1. Figure 1: Case study on comparing the state-of-the-art train￾ing system Megatron and HexiScale. Both systems run their optimal parallel strategies on the given three machines. Plan 3 is a potential good configuration. We fine-tune this plan by maximizing the number of transformer layers that use high intra-machine bandwidth for data parallel com￾munication. As visualized in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: First phase: the global graph is partitioned into three groups of GPUs by four steps: (i)-coarsen, (ii)-partition, (iii)-project, and (iv)-refine. GPUs in the global graph are divided into three groups which will be constructed as three pipelines. The key insight of the first phase algorithm is to partition the GPUs into multiple groups, with each group forming a separate pipeline. Data parallel communicat… view at source ↗
Figure 3
Figure 3. Figure 3: Second phase: each pipeline is created in three steps. (i) GPUs with high bandwidth connections are grouped by graph partition. (ii) intra-group strategy is searched sepa￾rately for each machine, i.e. GPUs in the same machine. (iii) Pipeline stage order is determined by permuting all intra￾group strategies by a top-𝜏 greedy search algorithm. until a pipeline path is generated. As shown in [PITH_FULL_IMAGE… view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end experiments of HexiScale compared with other systems under various experimental settings with Llama-2 (7B) and Llama-2 (13B) models. Llama (30B) 0 15 29 44 59 MFU (%) 39.0 7.4% 62.6% Homo-RDMA Llama (30B) 0 10 21 31 42 27.2 9.6% 96.3% Homo-Ethernet Llama (30B) 0 10 21 31 41 27.5 48.7% OOM Hetero-Setting-3 HexiScale Galvatron Megatron FSDP [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end experiments of HexiScale compared with other systems under various experimental settings with Llama (30B) model. 5.1 End-to-end Performance Experimental setup. LLM usually differ on model scales, instead of model structure, to thoroughly compare the end￾to-end performance of HexiScale and state-of-the-art frame￾works, we include Megatron, Galvatron, FSDP as baseline frameworks, and Llama models … view at source ↗
Figure 6
Figure 6. Figure 6: Breakdown experiments of HexiScale with Llama2 (7B), Llama2 (13B), and Llama (30B) models un￾der heterogeneous setting 1 and 3. Hetero-1 Hetero-2 Llama2 (7B) 0 77 154 230 307 Latency (ms) Hetero-1 Hetero-2 Llama2 (13B) 0 204 408 612 815 Hetero-3 Llama (30B) 0 236 472 708 944 Galvatron Comm Time Galvatron PP Bubble Time Galvatron Compute Time HexiScale Comm Time HexiScale PP Bubble Time HexiScale Compute Ti… view at source ↗
Figure 7
Figure 7. Figure 7: Breakdown of end-to-end time across different heterogeneous experimental settings and models. We bench￾mark the per-batch communication time, computation time, and pipeline bubble time for HexiScale and Galvatron. which increases both communication overhead and pipeline bubbles. Furthermore, Galvatron experiences additional bub￾bles due to imbalanced computation across pipeline stages, leading to performan… view at source ↗
Figure 8
Figure 8. Figure 8: Convergence comparison of the proposed search strategy and random graph partition with Llama-2 (7B) (left) and (30B) (right) models, where both run 20 times. effectively accounts for hardware heterogeneity, generating parallel execution plans that maximize system performance. Evaluate the simulation accuracy. We evaluate the ac￾curacy of our simulation in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Latency breakdown of HexiScale and Metis in heterogeneous setting 3 with Llama (30B) model. 5.4 Case Studies Compare with Metis. In heterogeneous setting 3, we also compare HexiScale with Metis, one of the state-of-the-art heterogeneous training systems [48] to demonstrate the su￾perior performance of HexiScale. Metis partitions computa￾tions into a single pipeline with a varying number of stages and a va… view at source ↗
Figure 10
Figure 10. Figure 10: HexiScale vs. Metis and Galvatron. . multiple rounds. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. Toward this end, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across heterogeneous GPUs, fully leveraging the available computational power. We compare the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), empirical results demonstrate that: (i) compared to state-of-the-art homogeneous baselines running over homogeneous GPUs, HexiScale achieves similar performance when running over heterogeneous GPUs with the same theoretical FLOPS; (ii) compared to state-of-the-art heterogeneous baselines running on the same heterogeneous clusters, HexiScale delivers $1.5\times$ to $2.4\times$ higher throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes HexiScale, a system for LLM training on heterogeneous GPUs that supports asymmetric partitioning across data, pipeline, and tensor parallelism. It formalizes the allocation as a constrained optimization problem and solves it with a hierarchical graph partitioning algorithm. For 7B–30B models, it claims performance comparable to homogeneous baselines on equal-FLOPS heterogeneous hardware and 1.5×–2.4× higher throughput than state-of-the-art heterogeneous baselines on the same clusters.

Significance. If the empirical claims hold, the work could meaningfully advance distributed training by enabling efficient use of mixed GPU resources, reducing reliance on uniform high-end clusters. The hierarchical partitioning approach addresses a practical optimization challenge in heterogeneous settings.

major comments (1)
  1. [Abstract] Abstract: The central claims of comparable performance to homogeneous baselines and 1.5×–2.4× throughput gains over heterogeneous baselines are stated without any visible implementation details, benchmark configurations, error bars, ablation studies, or experimental methodology. These details are load-bearing for verifying the throughput results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the opportunity to respond. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of comparable performance to homogeneous baselines and 1.5×–2.4× throughput gains over heterogeneous baselines are stated without any visible implementation details, benchmark configurations, error bars, ablation studies, or experimental methodology. These details are load-bearing for verifying the throughput results.

    Authors: We agree that the abstract itself is a concise summary and does not contain implementation details, benchmark configurations, error bars, ablation studies, or experimental methodology; this is by design given length constraints. The full manuscript provides these elements in Section 3 (system architecture and asymmetric parallelism), Section 4 (hierarchical graph partitioning algorithm and optimization formulation), and especially Section 5 (evaluation), which details the hardware clusters, model sizes (7B–30B), baseline systems, throughput measurements with error bars, ablation studies on partitioning strategies, and full experimental methodology. The abstract claims are therefore supported by the body of the paper rather than standing alone. revision: no

Circularity Check

0 steps flagged

No significant circularity; claims are empirical comparisons

full rationale

The paper presents HexiScale as a system that formalizes asymmetric allocation as a constrained optimization problem solved by a hierarchical graph partitioning algorithm, then reports empirical throughput measurements against homogeneous and heterogeneous baselines for 7B-30B models. No equations, fitted parameters, or predictions are described that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The central claims rest on direct experimental comparisons rather than any derivation chain that could exhibit circularity. The provided text contains no internal reductions of the form 'prediction equals fit by definition.'

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the main unstated premise is that the proposed optimization and partitioning approach incurs acceptable overhead on real hardware.

axioms (1)
  • domain assumption Asymmetric partitioning of data, pipeline, and tensor parallelism can be performed without prohibitive communication costs on heterogeneous GPUs.
    Implicit in the design of HexiScale as described in the abstract.

pith-pipeline@v0.9.0 · 5766 in / 1192 out tokens · 21857 ms · 2026-05-23T21:19:04.677033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

    cs.DC 2026-04 unverdicted novelty 7.0

    Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.

  2. HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

    cs.DC 2026-05 unverdicted novelty 6.0

    HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.

  3. HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters

    cs.DC 2025-09 unverdicted novelty 6.0

    HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-orie...

  4. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

    cs.LG 2026-04 unverdicted novelty 5.0

    ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 4 Pith papers · 7 internal anchors

  1. [1]

    Xin Ai, Qiange Wang, Chunyu Cao, Yanfeng Zhang, Chaoyi Chen, Hao Yuan, Yu Gu, and Ge Yu. 2024. NeutronOrch: Rethinking Sample- Based GNN Training under CPU-GPU Heterogeneous Environments. Proceedings of the VLDB Endowment 17, 8 (2024), 1995–2008

  2. [2]

    Amazon. 2024. Amazon EC2 Instance types. https://aws.amazon.com/ ec2/instance-types/

  3. [3]

    Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_ Claude_3.pdf

  4. [4]

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  5. [5]

    Zhenkun Cai, Xiao Yan, Kaihao Ma, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, and Fan Yu. 2021. Tensoropt: Exploring the tradeoffs in distributed dnn training with auto-parallelism. IEEE Transactions on Parallel and Distributed Systems 33, 8 (2021), 1967–1981

  6. [6]

    Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, et al. 2024. Optimizing Large Model Training through Overlapped Activation Recomputation. arXiv preprint arXiv:2406.08756 (2024)

  7. [7]

    Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. In The Twelfth International Conference on Learning Representations

  8. [8]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  9. [9]

    Bruce Hendrickson, Robert W Leland, et al. 1995. A Multi-Level Algo- rithm For Partitioning Graphs. SC 95, 28 (1995), 1–14

  10. [10]

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)

  11. [11]

    Technology Innovation Institute. 2023. Falcon 180B. https://falconllm. tii.ae/falcon-180b.html

  12. [12]

    Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, et al. 2022. Whale: Efficient giant model training over heterogeneous {GPUs }. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 673–688

  13. [13]

    Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks.. In ICML, Vol. 2279. 2288

  14. [14]

    Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems 1 (2019), 1–13

  15. [15]

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

  16. [16]

    Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui

  17. [17]

    In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

    OSDP: Optimal sharded data parallel for distributed deep learn- ing. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 2142–2150

  18. [18]

    Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. 2025. De- mystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs. arXiv preprint arXiv:2502.00722 (2025)

  19. [19]

    Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. 2025. ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments. arXiv preprint arXiv:2502.09334 (2025)

  20. [20]

    Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Bin- hang Yuan. 2024. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. In Forty-first International Conference on Machine Learning

  21. [21]

    Youhe Jiang, Ran Yan, and Binhang Yuan. 2025. HexGen-2: Disaggre- gated Generative Inference of LLMs in Heterogeneous Environment. arXiv preprint arXiv:2502.07903 (2025)

  22. [22]

    George Karypis and Vipin Kumar. 1998. A fast and high quality mul- tilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392

  23. [23]

    George Karypis and Vipin Kumar. 1998. Multilevel algorithms for multi-constraint graph partitioning. In SC’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE, 28–28

  24. [24]

    Brian W Kernighan and Shen Lin. 1970. An efficient heuristic proce- dure for partitioning graphs. The Bell system technical journal 49, 2 (1970), 291–307

  25. [25]

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. PyTorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment 13, 12 (2020), 3005–3018

  26. [26]

    Zhiyuan Li, Xun Jian, Yue Wang, Yingxia Shao, and Lei Chen. 2024. DAHA: Accelerating GNN Training with Data and Hardware Aware Execution Planning. Proceedings of the VLDB Endowment 17, 6 (2024), 1364–1376. 12

  27. [27]

    Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2024. Helix: Distributed Serving of Large Lan- guage Models via Max-Flow on Heterogeneous GPUs. arXiv preprint arXiv:2406.01566 (2024)

  28. [28]

    Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma, and Bin Cui. 2021. Heterogeneity-aware distributed machine learning training via partial reduce. In Proceedings of the 2021 International Conference on Management of Data . 2262–2270

  29. [29]

    Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia. 2023. Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training. Proceedings of the VLDB Endowment 16, 9 (2023), 2354–2363

  30. [30]

    Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism.Proceedings of the VLDB Endowment 16, 3 (2022), 470–479

  31. [31]

    Kabir Nagrecha. 2021. Model-parallel model selection for deep learn- ing systems. In Proceedings of the 2021 international conference on management of data. 2929–2931

  32. [32]

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15

  33. [33]

    Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning . PMLR, 7937–7947

  34. [34]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. In Proceedings of the International Conference for High Performance Computing, Netw...

  35. [35]

    Nvidia. 2006. GPU Computing Solutions for HPC. https://www.nvidia. com/docs/IO/43395/tesla_product_overview_dec.pdf

  36. [36]

    Nvidia. 2018. NVIDIA Reinvents Computer Graphics with Turing Architecture. https://nvidianews.nvidia.com/news/nvidia-reinvents- computer-graphics-with-turing-architecture

  37. [37]

    Nvidia. 2020. NVIDIA’s New Ampere Data Center GPU in Full Pro- duction. https://nvidianews.nvidia.com/news/nvidias-new-ampere- data-center-gpu-in-full-production

  38. [38]

    Nvidia. 2022. NVIDIA Announces Hopper Architec- ture, the Next Generation of Accelerated Computing. https://nvidianews.nvidia.com/news/nvidia-announces-hopper- architecture-the-next-generation-of-accelerated-computing

  39. [39]

    Nvidia. 2024. NVIDIA Blackwell Platform Arrives to Power a New Era of Computing. https://nvidianews.nvidia.com/news/nvidia-blackwell- platform-arrives-to-power-a-new-era-of-computing

  40. [40]

    OpenAI. 2024. OpenAI GPT-4o. https://platform.openai.com/docs/ models/gpt-4o

  41. [41]

    Jeongmin Brian Park, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-mei Hwu. 2024. Accelerating Sampling and Aggregation Opera- tions in GNN Frameworks with GPU Initiated Direct Storage Accesses. Proceedings of the VLDB Endowment 17, 6 (2024), 1227–1240

  42. [42]

    Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero Bubble (Almost) Pipeline Parallelism. In The Twelfth International Con- ference on Learning Representations

  43. [43]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

  44. [44]

    In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

    Zero: Memory optimizations toward training trillion param- eter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 1–16

  45. [45]

    Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–14

  46. [46]

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  47. [47]

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He

  48. [48]

    In 2021 USENIX Annual Technical Conference (USENIX ATC 21)

    {Zero-offload}: Democratizing {billion-scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) . 551–564

  49. [49]

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ra- mani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608 (2024)

  50. [50]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  51. [51]

    Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, Mohd Muzzammil, and Myeong- jae Jeon. 2024. Metis: Fast Automatic Distributed Training on Het- erogeneous {GPUs }. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 563–578

  52. [52]

    Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transforma- tions and parallelization. In 16th USENIX Symposium on Operating Systems Design and Imple...

  53. [53]

    Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. 2024. Improving Automatic Parallel Training via Balanced Memory Workload Optimization. IEEE Transactions on Knowledge and Data Engineering (2024)

  54. [54]

    Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. 2024. Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training. arXiv preprint arXiv:2412.01523 (2024)

  55. [55]

    Yen-Chuen Wei and Chung-Kuan Cheng. 1989. Towards efficient hierarchical designs by ratio cut partitioning. In1989 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers . IEEE, 298–301

  56. [56]

    Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, and Wen-mei Hwu. 2024. TBA: Faster Large Language Model Training Us- ing SSD-Based Activation Offloading. arXiv preprint arXiv:2408.10013 (2024)

  57. [57]

    Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, and Christopher De Sa. 2021. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems 3 (2021), 269–296

  58. [58]

    Xiaodong Yi, Shiwei Zhang, Ziyue Luo, Guoping Long, Lansong Diao, Chuan Wu, Zhen Zheng, Jun Yang, and Wei Lin. 2020. Optimizing distributed training deployment in heterogeneous GPU clusters. In Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies. 93–107

  59. [59]

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652 (2024)

  60. [60]

    Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christopher Re, and Ce Zhang. 2022. Decentral- ized training of foundation models in heterogeneous environments. 13 Advances in Neural Information Processing Systems 35 (2022), 25464– 25477

  61. [61]

    Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. 2022. MiCS: near-linear scaling for training gigantic model on public cloud. Proceedings of the VLDB Endowment 16, 1 (2022), 37–50

  62. [62]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. Proceedings of the VLDB Endowment 16, 12 (2023), 3848–3860

  63. [63]

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 559–578

  64. [64]

    Yonghao Zhuang, Lianmin Zheng, Zhuohan Li, Eric Xing, Qirong Ho, Joseph Gonzalez, Ion Stoica, Hao Zhang, and Hexu Zhao. 2023. On optimizing the communication of model parallelism. Proceedings of Machine Learning and Systems 5 (2023). 14 A Cost Modeling In this section, we model the Comm-Cost, Comp-Cost, and Mem-Cumsum step by step. First we model cost for...