HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware
Pith reviewed 2026-05-23 21:19 UTC · model grok-4.3
The pith
HexiScale enables LLM training on mixed GPUs by asymmetrically partitioning computations in data, pipeline and tensor parallelism, matching homogeneous performance while delivering 1.5 to 2.4 times higher throughput than prior heterogeneous
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HexiScale supports asymmetric partition of training computations across heterogeneous GPUs in the scope of data-, pipeline-, and tensor model parallelism. It formalizes the allocation as a constrained optimization problem and solves it with a hierarchical graph partitioning algorithm that fully leverages available computational power, yielding throughput comparable to state-of-the-art homogeneous baselines on equal-FLOPS GPU sets and 1.5× to 2.4× higher throughput than state-of-the-art heterogeneous baselines on the same mixed clusters for models ranging from 7B to 30B parameters.
What carries the argument
The hierarchical graph partitioning algorithm that solves the constrained optimization problem for asymmetric allocation of training computations across heterogeneous GPUs while controlling communication and synchronization overheads.
If this is right
- LLM training becomes possible on clusters containing mixed GPU generations or vendors without requiring replacement of the entire set.
- Total cluster utilization rises because every GPU receives a share of work proportional to its speed rather than being limited by the slowest device.
- The same allocation method produces consistent speedups across model scales from 7B to 30B parameters.
- Heterogeneous clusters can deliver training throughput within the range of homogeneous clusters that have identical aggregate floating-point capacity.
Where Pith is reading between the lines
- Operators could incrementally add newer GPUs to an existing cluster and still obtain near-linear scaling without re-purchasing the older units.
- The optimization formulation might be adapted to other distributed workloads such as distributed inference or scientific simulation codes that already use multiple parallelism styles.
- Dynamic re-partitioning could be added later so the system reacts automatically when GPUs are added, removed, or experience thermal throttling.
Load-bearing premise
The hierarchical graph partitioning algorithm can solve the allocation problem fast enough and with low enough communication overhead that the gains from better compute distribution are not erased by extra synchronization costs.
What would settle it
Measure end-to-end training throughput of HexiScale on a heterogeneous GPU cluster against a homogeneous cluster whose total theoretical FLOPS match; if throughput falls materially below the homogeneous case or fails to exceed other heterogeneous systems by at least 1.5×, the central performance claims do not hold.
Figures
read the original abstract
Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. Toward this end, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across heterogeneous GPUs, fully leveraging the available computational power. We compare the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), empirical results demonstrate that: (i) compared to state-of-the-art homogeneous baselines running over homogeneous GPUs, HexiScale achieves similar performance when running over heterogeneous GPUs with the same theoretical FLOPS; (ii) compared to state-of-the-art heterogeneous baselines running on the same heterogeneous clusters, HexiScale delivers $1.5\times$ to $2.4\times$ higher throughput.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HexiScale, a system for LLM training on heterogeneous GPUs that supports asymmetric partitioning across data, pipeline, and tensor parallelism. It formalizes the allocation as a constrained optimization problem and solves it with a hierarchical graph partitioning algorithm. For 7B–30B models, it claims performance comparable to homogeneous baselines on equal-FLOPS heterogeneous hardware and 1.5×–2.4× higher throughput than state-of-the-art heterogeneous baselines on the same clusters.
Significance. If the empirical claims hold, the work could meaningfully advance distributed training by enabling efficient use of mixed GPU resources, reducing reliance on uniform high-end clusters. The hierarchical partitioning approach addresses a practical optimization challenge in heterogeneous settings.
major comments (1)
- [Abstract] Abstract: The central claims of comparable performance to homogeneous baselines and 1.5×–2.4× throughput gains over heterogeneous baselines are stated without any visible implementation details, benchmark configurations, error bars, ablation studies, or experimental methodology. These details are load-bearing for verifying the throughput results.
Simulated Author's Rebuttal
We thank the referee for the review and the opportunity to respond. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of comparable performance to homogeneous baselines and 1.5×–2.4× throughput gains over heterogeneous baselines are stated without any visible implementation details, benchmark configurations, error bars, ablation studies, or experimental methodology. These details are load-bearing for verifying the throughput results.
Authors: We agree that the abstract itself is a concise summary and does not contain implementation details, benchmark configurations, error bars, ablation studies, or experimental methodology; this is by design given length constraints. The full manuscript provides these elements in Section 3 (system architecture and asymmetric parallelism), Section 4 (hierarchical graph partitioning algorithm and optimization formulation), and especially Section 5 (evaluation), which details the hardware clusters, model sizes (7B–30B), baseline systems, throughput measurements with error bars, ablation studies on partitioning strategies, and full experimental methodology. The abstract claims are therefore supported by the body of the paper rather than standing alone. revision: no
Circularity Check
No significant circularity; claims are empirical comparisons
full rationale
The paper presents HexiScale as a system that formalizes asymmetric allocation as a constrained optimization problem solved by a hierarchical graph partitioning algorithm, then reports empirical throughput measurements against homogeneous and heterogeneous baselines for 7B-30B models. No equations, fitted parameters, or predictions are described that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The central claims rest on direct experimental comparisons rather than any derivation chain that could exhibit circularity. The provided text contains no internal reductions of the form 'prediction equals fit by definition.'
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Asymmetric partitioning of data, pipeline, and tensor parallelism can be performed without prohibitive communication costs on heterogeneous GPUs.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HexiScale achieves comparable MFU when running over heterogeneous GPUs compared to state-of-the-art training systems running over homogeneous high-performance GPUs with the same total peak FLOPS.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
-
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
-
HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters
HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-orie...
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Reference graph
Works this paper leans on
-
[1]
Xin Ai, Qiange Wang, Chunyu Cao, Yanfeng Zhang, Chaoyi Chen, Hao Yuan, Yu Gu, and Ge Yu. 2024. NeutronOrch: Rethinking Sample- Based GNN Training under CPU-GPU Heterogeneous Environments. Proceedings of the VLDB Endowment 17, 8 (2024), 1995–2008
work page 2024
-
[2]
Amazon. 2024. Amazon EC2 Instance types. https://aws.amazon.com/ ec2/instance-types/
work page 2024
-
[3]
Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_ Claude_3.pdf
work page 2024
-
[4]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Zhenkun Cai, Xiao Yan, Kaihao Ma, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, and Fan Yu. 2021. Tensoropt: Exploring the tradeoffs in distributed dnn training with auto-parallelism. IEEE Transactions on Parallel and Distributed Systems 33, 8 (2021), 1967–1981
work page 2021
- [6]
-
[7]
Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. In The Twelfth International Conference on Learning Representations
work page 2024
-
[8]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Bruce Hendrickson, Robert W Leland, et al. 1995. A Multi-Level Algo- rithm For Partitioning Graphs. SC 95, 28 (1995), 1–14
work page 1995
-
[10]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)
work page 2019
-
[11]
Technology Innovation Institute. 2023. Falcon 180B. https://falconllm. tii.ae/falcon-180b.html
work page 2023
-
[12]
Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, et al. 2022. Whale: Efficient giant model training over heterogeneous {GPUs }. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 673–688
work page 2022
-
[13]
Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks.. In ICML, Vol. 2279. 2288
work page 2018
-
[14]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems 1 (2019), 1–13
work page 2019
-
[15]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui
-
[17]
In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
OSDP: Optimal sharded data parallel for distributed deep learn- ing. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 2142–2150
- [18]
- [19]
-
[20]
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Bin- hang Yuan. 2024. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. In Forty-first International Conference on Machine Learning
work page 2024
- [21]
-
[22]
George Karypis and Vipin Kumar. 1998. A fast and high quality mul- tilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359–392
work page 1998
-
[23]
George Karypis and Vipin Kumar. 1998. Multilevel algorithms for multi-constraint graph partitioning. In SC’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE, 28–28
work page 1998
-
[24]
Brian W Kernighan and Shen Lin. 1970. An efficient heuristic proce- dure for partitioning graphs. The Bell system technical journal 49, 2 (1970), 291–307
work page 1970
-
[25]
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. PyTorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment 13, 12 (2020), 3005–3018
work page 2020
-
[26]
Zhiyuan Li, Xun Jian, Yue Wang, Yingxia Shao, and Lei Chen. 2024. DAHA: Accelerating GNN Training with Data and Hardware Aware Execution Planning. Proceedings of the VLDB Endowment 17, 6 (2024), 1364–1376. 12
work page 2024
- [27]
-
[28]
Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma, and Bin Cui. 2021. Heterogeneity-aware distributed machine learning training via partial reduce. In Proceedings of the 2021 International Conference on Management of Data . 2262–2270
work page 2021
-
[29]
Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia. 2023. Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training. Proceedings of the VLDB Endowment 16, 9 (2023), 2354–2363
work page 2023
-
[30]
Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism.Proceedings of the VLDB Endowment 16, 3 (2022), 470–479
work page 2022
-
[31]
Kabir Nagrecha. 2021. Model-parallel model selection for deep learn- ing systems. In Proceedings of the 2021 international conference on management of data. 2929–2931
work page 2021
-
[32]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15
work page 2019
-
[33]
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning . PMLR, 7937–7947
work page 2021
-
[34]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. In Proceedings of the International Conference for High Performance Computing, Netw...
work page 2021
-
[35]
Nvidia. 2006. GPU Computing Solutions for HPC. https://www.nvidia. com/docs/IO/43395/tesla_product_overview_dec.pdf
work page 2006
-
[36]
Nvidia. 2018. NVIDIA Reinvents Computer Graphics with Turing Architecture. https://nvidianews.nvidia.com/news/nvidia-reinvents- computer-graphics-with-turing-architecture
work page 2018
-
[37]
Nvidia. 2020. NVIDIA’s New Ampere Data Center GPU in Full Pro- duction. https://nvidianews.nvidia.com/news/nvidias-new-ampere- data-center-gpu-in-full-production
work page 2020
-
[38]
Nvidia. 2022. NVIDIA Announces Hopper Architec- ture, the Next Generation of Accelerated Computing. https://nvidianews.nvidia.com/news/nvidia-announces-hopper- architecture-the-next-generation-of-accelerated-computing
work page 2022
-
[39]
Nvidia. 2024. NVIDIA Blackwell Platform Arrives to Power a New Era of Computing. https://nvidianews.nvidia.com/news/nvidia-blackwell- platform-arrives-to-power-a-new-era-of-computing
work page 2024
-
[40]
OpenAI. 2024. OpenAI GPT-4o. https://platform.openai.com/docs/ models/gpt-4o
work page 2024
-
[41]
Jeongmin Brian Park, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-mei Hwu. 2024. Accelerating Sampling and Aggregation Opera- tions in GNN Frameworks with GPU Initiated Direct Storage Accesses. Proceedings of the VLDB Endowment 17, 6 (2024), 1227–1240
work page 2024
-
[42]
Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero Bubble (Almost) Pipeline Parallelism. In The Twelfth International Con- ference on Learning Representations
work page 2024
-
[43]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
-
[44]
In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
Zero: Memory optimizations toward training trillion param- eter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 1–16
-
[45]
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–14
work page 2021
-
[46]
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazari- dou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlock- ing multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He
-
[48]
In 2021 USENIX Annual Technical Conference (USENIX ATC 21)
{Zero-offload}: Democratizing {billion-scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) . 551–564
work page 2021
-
[49]
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ra- mani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, Mohd Muzzammil, and Myeong- jae Jeon. 2024. Metis: Fast Automatic Distributed Training on Het- erogeneous {GPUs }. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 563–578
work page 2024
-
[52]
Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transforma- tions and parallelization. In 16th USENIX Symposium on Operating Systems Design and Imple...
work page 2022
-
[53]
Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. 2024. Improving Automatic Parallel Training via Balanced Memory Workload Optimization. IEEE Transactions on Knowledge and Data Engineering (2024)
work page 2024
- [54]
-
[55]
Yen-Chuen Wei and Chung-Kuan Cheng. 1989. Towards efficient hierarchical designs by ratio cut partitioning. In1989 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers . IEEE, 298–301
work page 1989
- [56]
-
[57]
Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, and Christopher De Sa. 2021. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems 3 (2021), 269–296
work page 2021
-
[58]
Xiaodong Yi, Shiwei Zhang, Ziyue Luo, Guoping Long, Lansong Diao, Chuan Wu, Zhen Zheng, Jun Yang, and Wei Lin. 2020. Optimizing distributed training deployment in heterogeneous GPU clusters. In Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies. 93–107
work page 2020
-
[59]
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christopher Re, and Ce Zhang. 2022. Decentral- ized training of foundation models in heterogeneous environments. 13 Advances in Neural Information Processing Systems 35 (2022), 25464– 25477
work page 2022
-
[61]
Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. 2022. MiCS: near-linear scaling for training gigantic model on public cloud. Proceedings of the VLDB Endowment 16, 1 (2022), 37–50
work page 2022
-
[62]
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. Proceedings of the VLDB Endowment 16, 12 (2023), 3848–3860
work page 2023
-
[63]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 559–578
work page 2022
-
[64]
Yonghao Zhuang, Lianmin Zheng, Zhuohan Li, Eric Xing, Qirong Ho, Joseph Gonzalez, Ion Stoica, Hao Zhang, and Hexu Zhao. 2023. On optimizing the communication of model parallelism. Proceedings of Machine Learning and Systems 5 (2023). 14 A Cost Modeling In this section, we model the Comm-Cost, Comp-Cost, and Mem-Cumsum step by step. First we model cost for...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.