HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

Bin Cui; Binhang Yuan; Fangcheng Fu; Ran Yan; Xiaonan Nie; Youhe Jiang

arxiv: 2409.01143 · v3 · pith:YZU5KN4Inew · submitted 2024-09-02 · 💻 cs.DC

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

Ran Yan , Youhe Jiang , Xiaonan Nie , Fangcheng Fu , Bin Cui , Binhang Yuan This is my paper

Pith reviewed 2026-05-23 21:19 UTC · model grok-4.3

classification 💻 cs.DC

keywords heterogeneous GPUsLLM trainingmodel parallelismasymmetric allocationgraph partitioningdistributed systemsresource utilizationconstrained optimization

0 comments

The pith

HexiScale enables LLM training on mixed GPUs by asymmetrically partitioning computations in data, pipeline and tensor parallelism, matching homogeneous performance while delivering 1.5 to 2.4 times higher throughput than prior heterogeneous

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HexiScale, a system that trains large language models across clusters of different GPUs instead of requiring identical high-performance units. It supports uneven splits of the work in data parallelism, pipeline parallelism and tensor parallelism, then casts the assignment of those splits to specific GPUs as a constrained optimization problem. An efficient hierarchical graph partitioning algorithm solves the problem so that faster GPUs receive proportionally more computation while communication costs stay manageable. If the approach holds, organizations could train models using whatever accelerators they already own rather than purchasing uniform fleets, without sacrificing speed relative to equal-total-FLOPS homogeneous clusters. Experiments on 7B to 30B models confirm the system reaches parity with homogeneous baselines of matching theoretical compute and exceeds existing heterogeneous methods by the reported factor.

Core claim

HexiScale supports asymmetric partition of training computations across heterogeneous GPUs in the scope of data-, pipeline-, and tensor model parallelism. It formalizes the allocation as a constrained optimization problem and solves it with a hierarchical graph partitioning algorithm that fully leverages available computational power, yielding throughput comparable to state-of-the-art homogeneous baselines on equal-FLOPS GPU sets and 1.5× to 2.4× higher throughput than state-of-the-art heterogeneous baselines on the same mixed clusters for models ranging from 7B to 30B parameters.

What carries the argument

The hierarchical graph partitioning algorithm that solves the constrained optimization problem for asymmetric allocation of training computations across heterogeneous GPUs while controlling communication and synchronization overheads.

If this is right

LLM training becomes possible on clusters containing mixed GPU generations or vendors without requiring replacement of the entire set.
Total cluster utilization rises because every GPU receives a share of work proportional to its speed rather than being limited by the slowest device.
The same allocation method produces consistent speedups across model scales from 7B to 30B parameters.
Heterogeneous clusters can deliver training throughput within the range of homogeneous clusters that have identical aggregate floating-point capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Operators could incrementally add newer GPUs to an existing cluster and still obtain near-linear scaling without re-purchasing the older units.
The optimization formulation might be adapted to other distributed workloads such as distributed inference or scientific simulation codes that already use multiple parallelism styles.
Dynamic re-partitioning could be added later so the system reacts automatically when GPUs are added, removed, or experience thermal throttling.

Load-bearing premise

The hierarchical graph partitioning algorithm can solve the allocation problem fast enough and with low enough communication overhead that the gains from better compute distribution are not erased by extra synchronization costs.

What would settle it

Measure end-to-end training throughput of HexiScale on a heterogeneous GPU cluster against a homogeneous cluster whose total theoretical FLOPS match; if throughput falls materially below the homogeneous case or fails to exceed other heterogeneous systems by at least 1.5×, the central performance claims do not hold.

Figures

Figures reproduced from arXiv: 2409.01143 by Bin Cui, Binhang Yuan, Fangcheng Fu, Ran Yan, Xiaonan Nie, Youhe Jiang.

**Figure 1.** Figure 1: Case study on comparing the state-of-the-art training system Megatron and HexiScale. Both systems run their optimal parallel strategies on the given three machines. Plan 3 is a potential good configuration. We fine-tune this plan by maximizing the number of transformer layers that use high intra-machine bandwidth for data parallel communication. As visualized in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: First phase: the global graph is partitioned into three groups of GPUs by four steps: (i)-coarsen, (ii)-partition, (iii)-project, and (iv)-refine. GPUs in the global graph are divided into three groups which will be constructed as three pipelines. The key insight of the first phase algorithm is to partition the GPUs into multiple groups, with each group forming a separate pipeline. Data parallel communicat… view at source ↗

**Figure 3.** Figure 3: Second phase: each pipeline is created in three steps. (i) GPUs with high bandwidth connections are grouped by graph partition. (ii) intra-group strategy is searched separately for each machine, i.e. GPUs in the same machine. (iii) Pipeline stage order is determined by permuting all intragroup strategies by a top-𝜏 greedy search algorithm. until a pipeline path is generated. As shown in [PITH_FULL_IMAGE… view at source ↗

**Figure 4.** Figure 4: End-to-end experiments of HexiScale compared with other systems under various experimental settings with Llama-2 (7B) and Llama-2 (13B) models. Llama (30B) 0 15 29 44 59 MFU (%) 39.0 7.4% 62.6% Homo-RDMA Llama (30B) 0 10 21 31 42 27.2 9.6% 96.3% Homo-Ethernet Llama (30B) 0 10 21 31 41 27.5 48.7% OOM Hetero-Setting-3 HexiScale Galvatron Megatron FSDP [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: End-to-end experiments of HexiScale compared with other systems under various experimental settings with Llama (30B) model. 5.1 End-to-end Performance Experimental setup. LLM usually differ on model scales, instead of model structure, to thoroughly compare the endto-end performance of HexiScale and state-of-the-art frameworks, we include Megatron, Galvatron, FSDP as baseline frameworks, and Llama models … view at source ↗

**Figure 6.** Figure 6: Breakdown experiments of HexiScale with Llama2 (7B), Llama2 (13B), and Llama (30B) models under heterogeneous setting 1 and 3. Hetero-1 Hetero-2 Llama2 (7B) 0 77 154 230 307 Latency (ms) Hetero-1 Hetero-2 Llama2 (13B) 0 204 408 612 815 Hetero-3 Llama (30B) 0 236 472 708 944 Galvatron Comm Time Galvatron PP Bubble Time Galvatron Compute Time HexiScale Comm Time HexiScale PP Bubble Time HexiScale Compute Ti… view at source ↗

**Figure 7.** Figure 7: Breakdown of end-to-end time across different heterogeneous experimental settings and models. We benchmark the per-batch communication time, computation time, and pipeline bubble time for HexiScale and Galvatron. which increases both communication overhead and pipeline bubbles. Furthermore, Galvatron experiences additional bubbles due to imbalanced computation across pipeline stages, leading to performan… view at source ↗

**Figure 8.** Figure 8: Convergence comparison of the proposed search strategy and random graph partition with Llama-2 (7B) (left) and (30B) (right) models, where both run 20 times. effectively accounts for hardware heterogeneity, generating parallel execution plans that maximize system performance. Evaluate the simulation accuracy. We evaluate the accuracy of our simulation in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 11.** Figure 11: Latency breakdown of HexiScale and Metis in heterogeneous setting 3 with Llama (30B) model. 5.4 Case Studies Compare with Metis. In heterogeneous setting 3, we also compare HexiScale with Metis, one of the state-of-the-art heterogeneous training systems [48] to demonstrate the superior performance of HexiScale. Metis partitions computations into a single pipeline with a varying number of stages and a va… view at source ↗

**Figure 10.** Figure 10: HexiScale vs. Metis and Galvatron. . multiple rounds. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. Toward this end, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across heterogeneous GPUs, fully leveraging the available computational power. We compare the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), empirical results demonstrate that: (i) compared to state-of-the-art homogeneous baselines running over homogeneous GPUs, HexiScale achieves similar performance when running over heterogeneous GPUs with the same theoretical FLOPS; (ii) compared to state-of-the-art heterogeneous baselines running on the same heterogeneous clusters, HexiScale delivers $1.5\times$ to $2.4\times$ higher throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HexiScale claims to match homogeneous LLM training performance on equal-FLOPS heterogeneous GPUs and beat other heterogeneous systems by 1.5-2.4x via asymmetric three-way parallelism and a hierarchical graph partitioner, but the abstract supplies no implementation details or experimental controls to assess those numbers.

read the letter

HexiScale is a system that trains LLMs on mixed GPUs by allowing asymmetric splits across data, pipeline, and tensor parallelism, then solves the resulting allocation as a constrained optimization with a hierarchical graph partitioning algorithm. The core idea is to avoid wasting fast GPUs on the pace of the slowest ones while keeping communication manageable. The paper reports that this matches standard homogeneous baselines when total FLOPS are the same and improves throughput 1.5-2.4x over prior heterogeneous approaches on 7B to 30B models. That framing of the problem is useful; real clusters often have heterogeneous hardware, and most existing frameworks either ignore it or handle it coarsely. The three-dimensional asymmetry plus the graph-based solver is the concrete step forward from earlier work that typically treated heterogeneity in only one dimension or used simpler heuristics. The empirical direction is also reasonable: direct throughput comparisons on the same clusters make the claims falsifiable in principle. The main weakness is that the abstract gives almost no experimental substance. There are no descriptions of the exact baselines, the GPU mixes tested, run counts, variance, or ablations showing that the hierarchical solver is what drives the gains rather than implementation tricks or favorable hardware. Without those, the 1.5-2.4x figure is hard to interpret. The optimization formulation itself is only sketched at a high level, so it is unclear how well it scales or how communication overhead is bounded in practice. This work is aimed at people building or tuning distributed training stacks who care about cost-effective use of whatever GPUs are available. Systems researchers who already follow Megatron, DeepSpeed, or similar frameworks would see the most direct value. It is worth sending to peer review because the problem is timely and the approach is specific enough to be critiqued and improved, even though the current evidence is thin and will need substantial expansion on experiments and reproducibility.

Referee Report

1 major / 0 minor

Summary. The paper proposes HexiScale, a system for LLM training on heterogeneous GPUs that supports asymmetric partitioning across data, pipeline, and tensor parallelism. It formalizes the allocation as a constrained optimization problem and solves it with a hierarchical graph partitioning algorithm. For 7B–30B models, it claims performance comparable to homogeneous baselines on equal-FLOPS heterogeneous hardware and 1.5×–2.4× higher throughput than state-of-the-art heterogeneous baselines on the same clusters.

Significance. If the empirical claims hold, the work could meaningfully advance distributed training by enabling efficient use of mixed GPU resources, reducing reliance on uniform high-end clusters. The hierarchical partitioning approach addresses a practical optimization challenge in heterogeneous settings.

major comments (1)

[Abstract] Abstract: The central claims of comparable performance to homogeneous baselines and 1.5×–2.4× throughput gains over heterogeneous baselines are stated without any visible implementation details, benchmark configurations, error bars, ablation studies, or experimental methodology. These details are load-bearing for verifying the throughput results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the opportunity to respond. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of comparable performance to homogeneous baselines and 1.5×–2.4× throughput gains over heterogeneous baselines are stated without any visible implementation details, benchmark configurations, error bars, ablation studies, or experimental methodology. These details are load-bearing for verifying the throughput results.

Authors: We agree that the abstract itself is a concise summary and does not contain implementation details, benchmark configurations, error bars, ablation studies, or experimental methodology; this is by design given length constraints. The full manuscript provides these elements in Section 3 (system architecture and asymmetric parallelism), Section 4 (hierarchical graph partitioning algorithm and optimization formulation), and especially Section 5 (evaluation), which details the hardware clusters, model sizes (7B–30B), baseline systems, throughput measurements with error bars, ablation studies on partitioning strategies, and full experimental methodology. The abstract claims are therefore supported by the body of the paper rather than standing alone. revision: no

Circularity Check

0 steps flagged

No significant circularity; claims are empirical comparisons

full rationale

The paper presents HexiScale as a system that formalizes asymmetric allocation as a constrained optimization problem solved by a hierarchical graph partitioning algorithm, then reports empirical throughput measurements against homogeneous and heterogeneous baselines for 7B-30B models. No equations, fitted parameters, or predictions are described that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The central claims rest on direct experimental comparisons rather than any derivation chain that could exhibit circularity. The provided text contains no internal reductions of the form 'prediction equals fit by definition.'

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the main unstated premise is that the proposed optimization and partitioning approach incurs acceptable overhead on real hardware.

axioms (1)

domain assumption Asymmetric partitioning of data, pipeline, and tensor parallelism can be performed without prohibitive communication costs on heterogeneous GPUs.
Implicit in the design of HexiScale as described in the abstract.

pith-pipeline@v0.9.0 · 5766 in / 1192 out tokens · 21857 ms · 2026-05-23T21:19:04.677033+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HexiScale achieves comparable MFU when running over heterogeneous GPUs compared to state-of-the-art training systems running over homogeneous high-performance GPUs with the same total peak FLOPS.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
cs.DC 2026-04 unverdicted novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
cs.DC 2026-05 unverdicted novelty 6.0

HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters
cs.DC 2025-09 unverdicted novelty 6.0

HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-orie...
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
cs.LG 2026-04 unverdicted novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 4 Pith papers · 7 internal anchors

[1]

Xin Ai, Qiange Wang, Chunyu Cao, Yanfeng Zhang, Chaoyi Chen, Hao Yuan, Yu Gu, and Ge Yu. 2024. NeutronOrch: Rethinking Sample- Based GNN Training under CPU-GPU Heterogeneous Environments. Proceedings of the VLDB Endowment 17, 8 (2024), 1995–2008

work page 2024
[2]

Amazon. 2024. Amazon EC2 Instance types. https://aws.amazon.com/ ec2/instance-types/

work page 2024
[3]

Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_ Claude_3.pdf

work page 2024
[4]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Zhenkun Cai, Xiao Yan, Kaihao Ma, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, and Fan Yu. 2021. Tensoropt: Exploring the tradeoffs in distributed dnn training with auto-parallelism. IEEE Transactions on Parallel and Distributed Systems 33, 8 (2021), 1967–1981

work page 2021
[6]

Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, et al. 2024. Optimizing Large Model Training through Overlapped Activation Recomputation. arXiv preprint arXiv:2406.08756 (2024)

work page arXiv 2024
[7]

Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. In The Twelfth International Conference on Learning Representations

work page 2024
[8]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Bruce Hendrickson, Robert W Leland, et al. 1995. A Multi-Level Algo- rithm For Partitioning Graphs. SC 95, 28 (1995), 1–14

work page 1995
[10]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)

work page 2019
[11]

Technology Innovation Institute. 2023. Falcon 180B. https://falconllm. tii.ae/falcon-180b.html

work page 2023
[12]

Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, et al. 2022. Whale: Efficient giant model training over heterogeneous {GPUs }. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 673–688

work page 2022
[13]

Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks.. In ICML, Vol. 2279. 2288

work page 2018
[14]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems 1 (2019), 1–13

work page 2019
[15]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui

work page
[17]

In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

OSDP: Optimal sharded data parallel for distributed deep learn- ing. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 2142–2150

work page
[18]

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. 2025. De- mystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs. arXiv preprint arXiv:2502.00722 (2025)

work page arXiv 2025
[19]

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. 2025. ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments. arXiv preprint arXiv:2502.09334 (2025)

work page arXiv 2025
[20]

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Bin- hang Yuan. 2024. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. In Forty-first International Conference on Machine Learning

work page 2024
[21]

Youhe Jiang, Ran Yan, and Binhang Yuan. 2025. HexGen-2: Disaggre- gated Generative Inference of LLMs in Heterogeneous Environment. arXiv preprint arXiv:2502.07903 (2025)

work page arXiv 2025
[22]

George Karypis and Vipin Kumar. 1998. A fast and high quality mul- tilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392

work page 1998
[23]

George Karypis and Vipin Kumar. 1998. Multilevel algorithms for multi-constraint graph partitioning. In SC’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE, 28–28

work page 1998
[24]

Brian W Kernighan and Shen Lin. 1970. An efficient heuristic proce- dure for partitioning graphs. The Bell system technical journal 49, 2 (1970), 291–307

work page 1970
[25]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. PyTorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment 13, 12 (2020), 3005–3018

work page 2020
[26]

Zhiyuan Li, Xun Jian, Yue Wang, Yingxia Shao, and Lei Chen. 2024. DAHA: Accelerating GNN Training with Data and Hardware Aware Execution Planning. Proceedings of the VLDB Endowment 17, 6 (2024), 1364–1376. 12

work page 2024
[27]

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2024. Helix: Distributed Serving of Large Lan- guage Models via Max-Flow on Heterogeneous GPUs. arXiv preprint arXiv:2406.01566 (2024)

work page arXiv 2024
[28]

Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma, and Bin Cui. 2021. Heterogeneity-aware distributed machine learning training via partial reduce. In Proceedings of the 2021 International Conference on Management of Data . 2262–2270

work page 2021
[29]

Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia. 2023. Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training. Proceedings of the VLDB Endowment 16, 9 (2023), 2354–2363

work page 2023
[30]

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism.Proceedings of the VLDB Endowment 16, 3 (2022), 470–479

work page 2022
[31]

Kabir Nagrecha. 2021. Model-parallel model selection for deep learn- ing systems. In Proceedings of the 2021 international conference on management of data. 2929–2931

work page 2021
[32]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15

work page 2019
[33]

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning . PMLR, 7937–7947

work page 2021
[34]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. In Proceedings of the International Conference for High Performance Computing, Netw...

work page 2021
[35]

Nvidia. 2006. GPU Computing Solutions for HPC. https://www.nvidia. com/docs/IO/43395/tesla_product_overview_dec.pdf

work page 2006
[36]

Nvidia. 2018. NVIDIA Reinvents Computer Graphics with Turing Architecture. https://nvidianews.nvidia.com/news/nvidia-reinvents- computer-graphics-with-turing-architecture

work page 2018
[37]

Nvidia. 2020. NVIDIA’s New Ampere Data Center GPU in Full Pro- duction. https://nvidianews.nvidia.com/news/nvidias-new-ampere- data-center-gpu-in-full-production

work page 2020
[38]

Nvidia. 2022. NVIDIA Announces Hopper Architec- ture, the Next Generation of Accelerated Computing. https://nvidianews.nvidia.com/news/nvidia-announces-hopper- architecture-the-next-generation-of-accelerated-computing

work page 2022
[39]

Nvidia. 2024. NVIDIA Blackwell Platform Arrives to Power a New Era of Computing. https://nvidianews.nvidia.com/news/nvidia-blackwell- platform-arrives-to-power-a-new-era-of-computing

work page 2024
[40]

OpenAI. 2024. OpenAI GPT-4o. https://platform.openai.com/docs/ models/gpt-4o

work page 2024
[41]

Jeongmin Brian Park, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-mei Hwu. 2024. Accelerating Sampling and Aggregation Opera- tions in GNN Frameworks with GPU Initiated Direct Storage Accesses. Proceedings of the VLDB Endowment 17, 6 (2024), 1227–1240

work page 2024
[42]

Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero Bubble (Almost) Pipeline Parallelism. In The Twelfth International Con- ference on Learning Representations

work page 2024
[43]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

work page
[44]

In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Zero: Memory optimizations toward training trillion param- eter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 1–16

work page
[45]

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–14

work page 2021
[46]

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He

work page
[48]

In 2021 USENIX Annual Technical Conference (USENIX ATC 21)

{Zero-offload}: Democratizing {billion-scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) . 551–564

work page 2021
[49]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ra- mani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, Mohd Muzzammil, and Myeong- jae Jeon. 2024. Metis: Fast Automatic Distributed Training on Het- erogeneous {GPUs }. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 563–578

work page 2024
[52]

Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transforma- tions and parallelization. In 16th USENIX Symposium on Operating Systems Design and Imple...

work page 2022
[53]

Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. 2024. Improving Automatic Parallel Training via Balanced Memory Workload Optimization. IEEE Transactions on Knowledge and Data Engineering (2024)

work page 2024
[54]

Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. 2024. Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training. arXiv preprint arXiv:2412.01523 (2024)

work page arXiv 2024
[55]

Yen-Chuen Wei and Chung-Kuan Cheng. 1989. Towards efficient hierarchical designs by ratio cut partitioning. In1989 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers . IEEE, 298–301

work page 1989
[56]

Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, and Wen-mei Hwu. 2024. TBA: Faster Large Language Model Training Us- ing SSD-Based Activation Offloading. arXiv preprint arXiv:2408.10013 (2024)

work page arXiv 2024
[57]

Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, and Christopher De Sa. 2021. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems 3 (2021), 269–296

work page 2021
[58]

Xiaodong Yi, Shiwei Zhang, Ziyue Luo, Guoping Long, Lansong Diao, Chuan Wu, Zhen Zheng, Jun Yang, and Wei Lin. 2020. Optimizing distributed training deployment in heterogeneous GPU clusters. In Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies. 93–107

work page 2020
[59]

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christopher Re, and Ce Zhang. 2022. Decentral- ized training of foundation models in heterogeneous environments. 13 Advances in Neural Information Processing Systems 35 (2022), 25464– 25477

work page 2022
[61]

Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. 2022. MiCS: near-linear scaling for training gigantic model on public cloud. Proceedings of the VLDB Endowment 16, 1 (2022), 37–50

work page 2022
[62]

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. Proceedings of the VLDB Endowment 16, 12 (2023), 3848–3860

work page 2023
[63]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 559–578

work page 2022
[64]

Yonghao Zhuang, Lianmin Zheng, Zhuohan Li, Eric Xing, Qirong Ho, Joseph Gonzalez, Ion Stoica, Hao Zhang, and Hexu Zhao. 2023. On optimizing the communication of model parallelism. Proceedings of Machine Learning and Systems 5 (2023). 14 A Cost Modeling In this section, we model the Comm-Cost, Comp-Cost, and Mem-Cumsum step by step. First we model cost for...

work page 2023

[1] [1]

Xin Ai, Qiange Wang, Chunyu Cao, Yanfeng Zhang, Chaoyi Chen, Hao Yuan, Yu Gu, and Ge Yu. 2024. NeutronOrch: Rethinking Sample- Based GNN Training under CPU-GPU Heterogeneous Environments. Proceedings of the VLDB Endowment 17, 8 (2024), 1995–2008

work page 2024

[2] [2]

Amazon. 2024. Amazon EC2 Instance types. https://aws.amazon.com/ ec2/instance-types/

work page 2024

[3] [3]

Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_ Claude_3.pdf

work page 2024

[4] [4]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Zhenkun Cai, Xiao Yan, Kaihao Ma, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, and Fan Yu. 2021. Tensoropt: Exploring the tradeoffs in distributed dnn training with auto-parallelism. IEEE Transactions on Parallel and Distributed Systems 33, 8 (2021), 1967–1981

work page 2021

[6] [6]

Ping Chen, Wenjie Zhang, Shuibing He, Yingjie Gu, Zhuwei Peng, Kexin Huang, Xuan Zhan, Weijian Chen, Yi Zheng, Zhefeng Wang, et al. 2024. Optimizing Large Model Training through Overlapped Activation Recomputation. arXiv preprint arXiv:2406.08756 (2024)

work page arXiv 2024

[7] [7]

Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. In The Twelfth International Conference on Learning Representations

work page 2024

[8] [8]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Bruce Hendrickson, Robert W Leland, et al. 1995. A Multi-Level Algo- rithm For Partitioning Graphs. SC 95, 28 (1995), 1–14

work page 1995

[10] [10]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)

work page 2019

[11] [11]

Technology Innovation Institute. 2023. Falcon 180B. https://falconllm. tii.ae/falcon-180b.html

work page 2023

[12] [12]

Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, et al. 2022. Whale: Efficient giant model training over heterogeneous {GPUs }. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 673–688

work page 2022

[13] [13]

Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks.. In ICML, Vol. 2279. 2288

work page 2018

[14] [14]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems 1 (2019), 1–13

work page 2019

[15] [15]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui

work page

[17] [17]

In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

OSDP: Optimal sharded data parallel for distributed deep learn- ing. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 2142–2150

work page

[18] [18]

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. 2025. De- mystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs. arXiv preprint arXiv:2502.00722 (2025)

work page arXiv 2025

[19] [19]

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. 2025. ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments. arXiv preprint arXiv:2502.09334 (2025)

work page arXiv 2025

[20] [20]

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Bin- hang Yuan. 2024. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. In Forty-first International Conference on Machine Learning

work page 2024

[21] [21]

Youhe Jiang, Ran Yan, and Binhang Yuan. 2025. HexGen-2: Disaggre- gated Generative Inference of LLMs in Heterogeneous Environment. arXiv preprint arXiv:2502.07903 (2025)

work page arXiv 2025

[22] [22]

George Karypis and Vipin Kumar. 1998. A fast and high quality mul- tilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392

work page 1998

[23] [23]

George Karypis and Vipin Kumar. 1998. Multilevel algorithms for multi-constraint graph partitioning. In SC’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE, 28–28

work page 1998

[24] [24]

Brian W Kernighan and Shen Lin. 1970. An efficient heuristic proce- dure for partitioning graphs. The Bell system technical journal 49, 2 (1970), 291–307

work page 1970

[25] [25]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. PyTorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment 13, 12 (2020), 3005–3018

work page 2020

[26] [26]

Zhiyuan Li, Xun Jian, Yue Wang, Yingxia Shao, and Lei Chen. 2024. DAHA: Accelerating GNN Training with Data and Hardware Aware Execution Planning. Proceedings of the VLDB Endowment 17, 6 (2024), 1364–1376. 12

work page 2024

[27] [27]

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2024. Helix: Distributed Serving of Large Lan- guage Models via Max-Flow on Heterogeneous GPUs. arXiv preprint arXiv:2406.01566 (2024)

work page arXiv 2024

[28] [28]

Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma, and Bin Cui. 2021. Heterogeneity-aware distributed machine learning training via partial reduce. In Proceedings of the 2021 International Conference on Management of Data . 2262–2270

work page 2021

[29] [29]

Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia. 2023. Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training. Proceedings of the VLDB Endowment 16, 9 (2023), 2354–2363

work page 2023

[30] [30]

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism.Proceedings of the VLDB Endowment 16, 3 (2022), 470–479

work page 2022

[31] [31]

Kabir Nagrecha. 2021. Model-parallel model selection for deep learn- ing systems. In Proceedings of the 2021 international conference on management of data. 2929–2931

work page 2021

[32] [32]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15

work page 2019

[33] [33]

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning . PMLR, 7937–7947

work page 2021

[34] [34]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. In Proceedings of the International Conference for High Performance Computing, Netw...

work page 2021

[35] [35]

Nvidia. 2006. GPU Computing Solutions for HPC. https://www.nvidia. com/docs/IO/43395/tesla_product_overview_dec.pdf

work page 2006

[36] [36]

Nvidia. 2018. NVIDIA Reinvents Computer Graphics with Turing Architecture. https://nvidianews.nvidia.com/news/nvidia-reinvents- computer-graphics-with-turing-architecture

work page 2018

[37] [37]

Nvidia. 2020. NVIDIA’s New Ampere Data Center GPU in Full Pro- duction. https://nvidianews.nvidia.com/news/nvidias-new-ampere- data-center-gpu-in-full-production

work page 2020

[38] [38]

Nvidia. 2022. NVIDIA Announces Hopper Architec- ture, the Next Generation of Accelerated Computing. https://nvidianews.nvidia.com/news/nvidia-announces-hopper- architecture-the-next-generation-of-accelerated-computing

work page 2022

[39] [39]

Nvidia. 2024. NVIDIA Blackwell Platform Arrives to Power a New Era of Computing. https://nvidianews.nvidia.com/news/nvidia-blackwell- platform-arrives-to-power-a-new-era-of-computing

work page 2024

[40] [40]

OpenAI. 2024. OpenAI GPT-4o. https://platform.openai.com/docs/ models/gpt-4o

work page 2024

[41] [41]

Jeongmin Brian Park, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-mei Hwu. 2024. Accelerating Sampling and Aggregation Opera- tions in GNN Frameworks with GPU Initiated Direct Storage Accesses. Proceedings of the VLDB Endowment 17, 6 (2024), 1227–1240

work page 2024

[42] [42]

Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero Bubble (Almost) Pipeline Parallelism. In The Twelfth International Con- ference on Learning Representations

work page 2024

[43] [43]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

work page

[44] [44]

In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Zero: Memory optimizations toward training trillion param- eter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 1–16

work page

[45] [45]

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–14

work page 2021

[46] [46]

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He

work page

[48] [48]

In 2021 USENIX Annual Technical Conference (USENIX ATC 21)

{Zero-offload}: Democratizing {billion-scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) . 551–564

work page 2021

[49] [49]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ra- mani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, Mohd Muzzammil, and Myeong- jae Jeon. 2024. Metis: Fast Automatic Distributed Training on Het- erogeneous {GPUs }. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 563–578

work page 2024

[52] [52]

Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transforma- tions and parallelization. In 16th USENIX Symposium on Operating Systems Design and Imple...

work page 2022

[53] [53]

Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. 2024. Improving Automatic Parallel Training via Balanced Memory Workload Optimization. IEEE Transactions on Knowledge and Data Engineering (2024)

work page 2024

[54] [54]

Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. 2024. Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training. arXiv preprint arXiv:2412.01523 (2024)

work page arXiv 2024

[55] [55]

Yen-Chuen Wei and Chung-Kuan Cheng. 1989. Towards efficient hierarchical designs by ratio cut partitioning. In1989 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers . IEEE, 298–301

work page 1989

[56] [56]

Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, and Wen-mei Hwu. 2024. TBA: Faster Large Language Model Training Us- ing SSD-Based Activation Offloading. arXiv preprint arXiv:2408.10013 (2024)

work page arXiv 2024

[57] [57]

Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, and Christopher De Sa. 2021. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems 3 (2021), 269–296

work page 2021

[58] [58]

Xiaodong Yi, Shiwei Zhang, Ziyue Luo, Guoping Long, Lansong Diao, Chuan Wu, Zhen Zheng, Jun Yang, and Wei Lin. 2020. Optimizing distributed training deployment in heterogeneous GPU clusters. In Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies. 93–107

work page 2020

[59] [59]

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christopher Re, and Ce Zhang. 2022. Decentral- ized training of foundation models in heterogeneous environments. 13 Advances in Neural Information Processing Systems 35 (2022), 25464– 25477

work page 2022

[61] [61]

Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. 2022. MiCS: near-linear scaling for training gigantic model on public cloud. Proceedings of the VLDB Endowment 16, 1 (2022), 37–50

work page 2022

[62] [62]

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. Proceedings of the VLDB Endowment 16, 12 (2023), 3848–3860

work page 2023

[63] [63]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 559–578

work page 2022

[64] [64]

Yonghao Zhuang, Lianmin Zheng, Zhuohan Li, Eric Xing, Qirong Ho, Joseph Gonzalez, Ion Stoica, Hao Zhang, and Hexu Zhao. 2023. On optimizing the communication of model parallelism. Proceedings of Machine Learning and Systems 5 (2023). 14 A Cost Modeling In this section, we model the Comm-Cost, Comp-Cost, and Mem-Cumsum step by step. First we model cost for...

work page 2023