HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters
Pith reviewed 2026-05-18 12:18 UTC · model grok-4.3
The pith
Harp automates parallel training for heterogeneous GPU clusters to deliver 1.3x-1.6x higher performance than current frameworks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Harp introduces a fine-grained planner that efficiently searches a wide space for the inter-operator parallel strategy, enabling it to alleviate communication overheads while maintaining balanced loads across heterogeneous accelerators. In addition, Harp implements a heterogeneity-aware 1F1B scheduler that adaptively adjusts the execution timing and ordering of microbatches based on network characteristics, maximizing computation-communication overlap under cross-cluster interconnects while incurring only minimal memory overhead.
What carries the argument
The fine-grained planner searching inter-operator parallel strategies combined with the heterogeneity-aware 1F1B scheduler for adaptive microbatch timing.
If this is right
- Training performance improves by 1.3x to 1.6x on heterogeneous setups compared to state-of-the-art frameworks.
- Communication overheads are reduced through better parallel strategy selection.
- Loads are balanced across accelerators of different capabilities.
- Computation and communication overlap is maximized with low memory cost.
Where Pith is reading between the lines
- Similar planning and scheduling ideas could apply to other heterogeneous computing environments like those mixing CPUs and GPUs.
- The scheduler's adaptability suggests benefits in dynamic environments where network conditions change during training.
- Future systems might use the same search technique to handle additional forms of parallelism automatically.
Load-bearing premise
The fine-grained planner can efficiently search the inter-operator parallel strategy space and the heterogeneity-aware scheduler can adapt execution timing without incurring prohibitive search or memory overhead on realistic heterogeneous clusters.
What would settle it
A benchmark on a real heterogeneous GPU cluster where Harp shows no speedup or incurs high planning and memory costs compared to baseline frameworks would challenge the central claims.
Figures
read the original abstract
With the rapid evolution of GPU architectures, the heterogeneity of model training infrastructures is steadily increasing. In such environments, effectively utilizing all available heterogeneous accelerators becomes critical for distributed model training. However, existing frameworks, which are primarily designed for homogeneous clusters, often exhibit significant resource underutilization when deployed on heterogeneous accelerators and networks. In this paper, we present Harp, an automated parallel training framework designed specifically for heterogeneous clusters. Harp introduces a fine-grained planner that efficiently searches a wide space for the inter-operator parallel strategy, enabling Harp to alleviate communication overheads while maintaining balanced loads across heterogeneous accelerators. In addition, Harp implements a heterogeneity-aware 1F1B scheduler that adaptively adjusts the execution timing and ordering of microbatches based on network characteristics, maximizing computation-communication overlap under cross-cluster interconnects while incurring only minimal memory overhead. Our evaluation results show that Harp can deliver 1.3x-1.6x higher performance on heterogeneous clusters than state-of-the-art training frameworks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents HARP, an automated parallel training framework for heterogeneous GPU clusters. It proposes a fine-grained planner for searching inter-operator parallel strategies to alleviate communication overheads and balance loads across heterogeneous accelerators, along with a heterogeneity-aware 1F1B scheduler that adaptively adjusts microbatch execution timing and ordering to maximize computation-communication overlap with minimal memory overhead. The central empirical claim is that HARP delivers 1.3x-1.6x higher performance than state-of-the-art frameworks on heterogeneous clusters.
Significance. If the performance claims hold under rigorous evaluation, this work would be significant for the field of distributed deep learning systems. As GPU clusters become more heterogeneous due to evolving architectures, frameworks that efficiently utilize mixed hardware can lead to substantial improvements in training efficiency and resource utilization. The focus on automated planning and adaptive scheduling addresses a practical pain point, and if the overheads are indeed minimal, it could influence the design of future training systems.
major comments (2)
- [Evaluation] Evaluation section: The reported 1.3x-1.6x speedups are presented without details on cluster configurations (GPU types/counts, heterogeneity degree, network topologies), baseline implementations, statistical significance, or results across varying model sizes. These omissions are load-bearing for the central performance claim, as the speedups cannot be assessed for robustness or generalizability without them.
- [Planner and Scheduler] Planner and scheduler sections: The fine-grained planner is asserted to efficiently enumerate inter-operator strategies and the 1F1B scheduler to adapt without prohibitive overhead, but no concrete bounds or measurements are supplied (e.g., planner runtime vs. operator count or heterogeneity degree; peak memory delta vs. standard 1F1B). If search costs scale combinatorially or adaptation inflates memory, the net speedup disappears, directly undermining the headline result.
minor comments (2)
- [Abstract] Abstract: 'State-of-the-art training frameworks' is not named; explicitly listing baselines (e.g., Megatron-LM, DeepSpeed) would improve clarity.
- [Introduction] Notation: Inter-operator parallel strategy terms could be defined earlier with a small example to aid readers.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review of our manuscript. We address each of the major comments below and outline the revisions we plan to make to address the concerns raised.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The reported 1.3x-1.6x speedups are presented without details on cluster configurations (GPU types/counts, heterogeneity degree, network topologies), baseline implementations, statistical significance, or results across varying model sizes. These omissions are load-bearing for the central performance claim, as the speedups cannot be assessed for robustness or generalizability without them.
Authors: We agree that providing these details is essential for validating the performance claims. Accordingly, we will revise the Evaluation section to include comprehensive information on the cluster setups, including specific GPU models and quantities used in our heterogeneous testbeds, the degree of heterogeneity, and network configurations. We will also detail the baseline implementations, report results with statistical measures such as means and standard deviations from repeated experiments, and present performance data for a broader range of model sizes. These changes will be incorporated in the next version of the manuscript. revision: yes
-
Referee: [Planner and Scheduler] Planner and scheduler sections: The fine-grained planner is asserted to efficiently enumerate inter-operator strategies and the 1F1B scheduler to adapt without prohibitive overhead, but no concrete bounds or measurements are supplied (e.g., planner runtime vs. operator count or heterogeneity degree; peak memory delta vs. standard 1F1B). If search costs scale combinatorially or adaptation inflates memory, the net speedup disappears, directly undermining the headline result.
Authors: We recognize the need for explicit measurements to demonstrate that the overheads remain low. In the revised manuscript, we will add quantitative results on the planner's runtime scaling with respect to the number of operators and the level of heterogeneity. Additionally, we will include comparisons of peak memory consumption between our heterogeneity-aware 1F1B scheduler and the standard 1F1B approach. These measurements will show that the overheads are minimal and do not offset the achieved speedups. We will also provide any relevant theoretical analysis on the search efficiency. revision: yes
Circularity Check
No circularity: empirical system claims rest on direct benchmarking
full rationale
The paper presents an engineering framework (Harp) whose central claims are performance speedups measured on heterogeneous GPU clusters. These results are obtained by running the implemented planner and scheduler against baselines on concrete workloads, not by any closed mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. No equations appear that would reduce the reported 1.3x-1.6x gains to quantities defined from the same runs; the planner's search and the 1F1B scheduler's adaptation are algorithmic descriptions whose overheads are asserted to be low and then validated empirically. The derivation chain is therefore self-contained against external benchmarks and contains none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al . 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al . 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901
work page 2020
-
[3]
Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. InInternational Conference on Learning Representations (ICLR)
work page 2024
-
[4]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
-
[5]
InAdvances in Neural Information Processing Systems (NeurIPS)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS)
-
[6]
Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machine learning programs via high-level tracing.Systems for Machine Learning4, 9 (2018)
work page 2018
-
[7]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems32 (2019)
work page 2019
-
[8]
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catan- zaro. 2023. Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems5 (2023)
work page 2023
- [9]
-
[10]
Shigang Li and Torsten Hoefler. 2021. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. InProceed- ings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14
work page 2021
-
[11]
Lianmin Zheng. 2022. Github repository: alpa-projects/alpa.https: //github.com/alpa-projects/alpa, Last accessed on 2025-01-05
work page 2022
-
[12]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerg- ing {AI} applications. In13th USENIX symposium on operating systems design and implementation (OSDI 18). 561–577
work page 2018
-
[13]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating sys- tems principles. 1–15
work page 2019
-
[14]
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel dnn training. InInternational Conference on Machine Learning. PMLR, 7937–7947
work page 2021
-
[15]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. InProceedings of the International Conference for High Performance Computing, Netwo...
work page 2021
-
[16]
Nvidia. 2025. Hopper Architecture.https://www.nvidia.cn/data- center/technologies/hopper-architecture/, Last accessed on 2025-09- 21
work page 2025
-
[17]
Nvidia. 2025. Hopper Architecture.https://www.nvidia.cn/data- center/technologies/blackwell-architecture/, Last accessed on 2025- 09-21
work page 2025
-
[18]
Nvidia. 2025. Nvida SuperNIC.https://www.nvidia.cn/networking/ products/ethernet/supernic/, Last accessed on 2025-09-21
work page 2025
-
[19]
Nvidia. 2025. The nvidia collective communication library.https: //github.com/openxla/xla, Last accessed on 2025-09-21
work page 2025
- [20]
-
[21]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al
-
[22]
Improving language understanding by generative pre-training. (2018)
work page 2018
-
[23]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.OpenAI blog1, 8 (2019), 9. 13 Preprint, September 2025, Antian Liang, Zhigang Zhao, Kai Zhang, Xuri Shi, Chuantao Li, Chunxiao Wang, Zhenying He, Yinan Jing, and X. Sean Wang
work page 2019
-
[26]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
-
[27]
InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis
Zero: Memory optimizations toward training trillion param- eter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16
-
[28]
Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov
-
[29]
InInternational Conference on Machine Learn- ing
Swarm parallelism: Training large models can be surprisingly communication-efficient. InInternational Conference on Machine Learn- ing. PMLR, 29416–29440
-
[30]
ShanHe Team. 2025. ShanHe SuperComputing Platform.https://www. shanhe.com/, Last accessed on 2025-09-21
work page 2025
-
[31]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[32]
Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, Mohd Muzzammil, and Myeong- jae Jeon. 2024. Metis: Fast Automatic Distributed Training on Het- erogeneous {GPUs }. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 563–578
work page 2024
- [33]
-
[34]
XLA and TensorFlow teams. 2017. XLA — TensorFlow, com- piled.https://tensorflow.google.cn/xla?hl=zh-cn#inspect_compiled_ programs, Last accessed on 2024-05-07
work page 2017
-
[35]
Si Xu, Zixiao Huang, Yan Zeng, Shengen Yan, Xuefei Ning, Haolin Ye, Sipei Gu, Chunsheng Shui, Zhezheng Lin, Hao Zhang, et al. 2024. HetHub: A Heterogeneous distributed hybrid training system for large- scale models.arXiv e-prints(2024), arXiv–2405
work page 2024
-
[36]
Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yan- ping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. 2021. GSPMD: general and scalable paral- lelization for ML computation graphs.arXiv preprint arXiv:2105.04663 (2021)
work page internal anchor Pith review arXiv 2021
-
[37]
Ran Yan, Youhe Jiang, Xiaonan Nie, Fangcheng Fu, Bin Cui, and Binhang Yuan. 2024. HexiScale: Accommodating Large Language Model Training over Heterogeneous Environment.arXiv preprint arXiv:2409.01143(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christopher Re, and Ce Zhang. 2022. Decentral- ized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems35 (2022), 25464– 25477
work page 2022
-
[39]
Jinghui Zhang, Geng Niu, Qiangsheng Dai, Haorui Li, Zhihua Wu, Fang Dong, and Zhiang Wu. 2023. PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters.Neurocomputing555 (2023), 126661
work page 2023
-
[40]
Shiwei Zhang, Lansong Diao, Chuan Wu, Zongyan Cao, Siyu Wang, and Wei Lin. 2024. HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis. InProceedings of the Nineteenth European Conference on Computer Systems. 524–541
work page 2024
- [41]
-
[42]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578
work page 2022
-
[43]
Yonghao Zhuang, Lianmin Zheng, Zhuohan Li, Eric Xing, Qirong Ho, Joseph Gonzalez, Ion Stoica, Hao Zhang, and Hexu Zhao. 2023. On optimizing the communication of model parallelism.Proceedings of Machine Learning and Systems5 (2023). 14
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.