Recognition: no theorem link
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
Pith reviewed 2026-05-12 03:57 UTC · model grok-4.3
The pith
ResiHP detects real hardware failures during hybrid-parallel LLM training by predicting workload-induced time variations and then dynamically resizes parallelism groups to keep throughput high.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ResiHP enables robust failure detection and fine-grained adaptation for hybrid parallel training. It employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. The Scheduler dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures.
What carries the argument
The workload-aware execution time predictor, which forecasts expected iteration times from current data properties to flag only true hardware slowdowns, paired with the Scheduler that reconfigures hybrid parallelism groups, partitions, and assignment rules in response.
If this is right
- Training jobs continue at high speed even when some GPUs slow down, instead of waiting for the slowest device.
- Overhead from repeated failure checks drops because only genuine problems trigger the scheduler.
- Hybrid parallelism configurations stay balanced across devices after a failure instead of becoming permanently skewed.
- Overall cluster utilization rises because adaptations happen at the level of groups and partitions rather than whole restarts.
Where Pith is reading between the lines
- The same predictor-plus-scheduler pattern could be tested on data-parallel or pipeline-parallel training to see whether the separation of fluctuation from failure generalizes.
- Lower failure overhead might let teams run longer continuous jobs, reducing the frequency of checkpointing and recovery steps.
- Because the detector stays lightweight, it could be added to existing training frameworks without large changes to the core loop.
Load-bearing premise
The workload-aware predictor can reliably tell hardware failures apart from ordinary iteration time changes caused by sequence length differences in the training data.
What would settle it
Running the system on a known dataset with controlled sequence length variation and no actual hardware faults, then measuring whether it still issues frequent false failure alerts that trigger unnecessary adaptations and drop overall throughput.
Figures
read the original abstract
Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures. Experiments show that ResiHP improves training throughput by 1.04-4.39$\times$ compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ResiHP, a resilient system for hybrid-parallel LLM training at scale. It proposes a Detector component that uses a workload-aware execution time predictor to distinguish genuine hardware failures from iteration-time variance caused by sequence-length variability, paired with a Scheduler that dynamically adjusts parallelism group sizes, model partitioning, and workload policies. Experiments on a 256-GPU cluster are reported to yield 1.04–4.39× throughput gains versus prior resilient training systems under diverse failure scenarios.
Significance. If the empirical claims hold after proper validation, ResiHP would address a practical bottleneck in large-scale LLM training by reducing spurious fail-slow detections and enabling fine-grained, low-overhead recovery under hybrid parallelism. The combination of lightweight online prediction and dynamic adaptation could improve cluster utilization in production environments where failures are common.
major comments (2)
- [Detector / workload-aware predictor] The central claim that the workload-aware execution time predictor reliably separates hardware failures from sequence-length-induced variance (while remaining lightweight for online use) lacks any quantitative support. No model form, feature set, prediction-error distribution, false-positive rate under realistic length distributions, or measured overhead appears in the Detector description; without these, it is impossible to verify that the predictor solves the spurious-detection problem that the abstract attributes to prior systems.
- [Experiments / evaluation] The reported 1.04–4.39× throughput improvements are presented without essential experimental details: the specific baselines, failure-injection methodology, number of trials, statistical significance tests, workload characteristics (model size, dataset sequence-length distribution), or cluster configuration parameters. These omissions make the quantitative results unverifiable and prevent assessment of whether the gains are attributable to the proposed techniques.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the opportunity to clarify the technical contributions and strengthen the presentation. Below we provide point-by-point responses to the major comments. We will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Detector / workload-aware predictor] The central claim that the workload-aware execution time predictor reliably separates hardware failures from sequence-length-induced variance (while remaining lightweight for online use) lacks any quantitative support. No model form, feature set, prediction-error distribution, false-positive rate under realistic length distributions, or measured overhead appears in the Detector description; without these, it is impossible to verify that the predictor solves the spurious-detection problem that the abstract attributes to prior systems.
Authors: We agree that the current high-level description of the Detector does not provide sufficient quantitative evidence. In the revised manuscript we will expand Section 3.2 with the exact model form (a lightweight online linear regressor), the feature set (per-iteration sequence-length statistics plus a short history of execution times), the observed prediction-error distribution, false-positive rates measured on realistic sequence-length distributions drawn from the C4 dataset, and the measured online overhead (less than 1 % of iteration time on the target hardware). These additions will allow readers to verify that the predictor successfully reduces spurious fail-slow detections while remaining suitable for online use. revision: yes
-
Referee: [Experiments / evaluation] The reported 1.04–4.39× throughput improvements are presented without essential experimental details: the specific baselines, failure-injection methodology, number of trials, statistical significance tests, workload characteristics (model size, dataset sequence-length distribution), or cluster configuration parameters. These omissions make the quantitative results unverifiable and prevent assessment of whether the gains are attributable to the proposed techniques.
Authors: We acknowledge that the Evaluation section currently omits several reproducibility details. In the revision we will add a dedicated subsection that specifies: the exact baseline systems and their configurations, the failure-injection methodology (including single- and multi-GPU failure patterns and rates), the number of independent trials per scenario together with statistical significance testing, the workload characteristics (model sizes, sequence-length distribution statistics from the training corpus), and the full 256-GPU cluster configuration (GPU type, interconnect, and software stack). These additions will make the reported throughput gains verifiable and will clarify their attribution to ResiHP’s techniques. revision: yes
Circularity Check
No circularity: claims rest on system design and empirical measurements
full rationale
The paper describes an engineering system (ResiHP) consisting of a Detector with a workload-aware execution time predictor and a Scheduler for dynamic hybrid parallelism adaptations. Central claims concern measured throughput gains (1.04-4.39×) under injected failures in a 256-GPU cluster. No mathematical derivation chain, first-principles result, or prediction is presented that reduces by construction to its own inputs, fitted parameters, or self-citations. The predictor is introduced as a design component to address a stated limitation of prior systems; its accuracy is not presupposed but is instead part of the evaluated artifact. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Hardware failures produce detectable performance skew under hybrid parallelism that can be separated from normal iteration-time variation.
- domain assumption Dynamic changes to parallelism group sizes, model partitioning, and scheduling policies can improve efficiency without introducing prohibitive overhead.
invented entities (2)
-
ResiHP Detector
no independent evidence
-
ResiHP Scheduler
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Diego Agudelo-España, Sebastian Gomez-Gonzalez, Stefan Bauer, Bernhard Schölkopf, and Jan Peters. 2020. Bayesian online prediction of change points. In Conference on uncertainty in artificial intelligence. PMLR, 320–329
work page 2020
-
[2]
Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, and Nipun Kwatra. 2022. Varuna: scalable, low-cost training of massive deep learning models. InProceedings of the Seventeenth European Conference on Computer Systems. 472–487
work page 2022
-
[3]
Zhenkun Cai, Xiao Yan, Kaihao Ma, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, and Fan Yu. 2022. TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism.IEEE Transactions on Parallel and Distributed Systems33, 8 (2022), 1967–1981
work page 2022
-
[4]
Mike Chow, Yang Wang, William Wang, Ayichew Hailu, Rohan Bopardikar, Bin Zhang, Jialiang Qu, David Meisner, Santosh Sonawane, Yunqi Zhang, et al. 2024. {ServiceLab}: Preventing tiny performance regressions at hyperscale through {Pre-Production} testing. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 545–562
work page 2024
-
[5]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Gen Dong, Yu Hua, Yongle Zhang, Zhangyu Chen, and Menglei Chen. 2025. Understanding and detecting fail-slow hardware failure bugs in cloud systems. In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA)(USENIX ATC ’25). USENIX Association, USA, Article 66, 16 pages
work page 2025
-
[7]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407
work page 2024
- [8]
-
[9]
Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices. InProceedings of the twenty- fourth international conference on architectural support for programming languages and operating systems. 19–33
work page 2019
-
[10]
Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, and Christos Kozyrakis
-
[11]
In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles
Recycle: Resilient training of large dnns using pipeline adaptation. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 211–228
- [12]
-
[13]
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. 2025. ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs. InProceedings of the ACM SIGCOMM 2025 Conference. 963–978
work page 2025
-
[14]
Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically rigorous Java performance evaluation. InProceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications
work page 2007
-
[15]
Haryadi S Gunawi, Riza O Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, et al. 2018. Fail-slow at scale: Evidence of hardware perfor- mance faults in large production systems.ACM Transactions on Storage (TOS) 14, 3 (2018), 1–26
work page 2018
-
[16]
Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Deepthi Srinivasan, Biswaranjan Panda, Andrew Baptist, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha,...
- [17]
-
[18]
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. 2024. Characterization of large language model development in the datacenter. InProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation(Santa Clara, CA, USA)(NSDI’24)...
work page 2024
-
[19]
Le, Yonghui Wu, and Zhifeng Chen
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019.GPipe: efficient training of giant neural networks using pipeline parallelism. Curran Associates Inc., Red Hook, NY, USA
work page 2019
-
[20]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuai- wen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Trans- former Models. arXiv:2309.14509 [cs.LG] https://arxiv.org/abs/2309.14509
work page internal anchor Pith review arXiv 2023
-
[21]
Raj Jain. 1991.The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley
work page 1991
-
[22]
Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowdhury. 2023. Oobleck: Resilient distributed training of large models using pipeline templates. InProceedings of the 29th Symposium on Operating Systems Principles. 382–395
work page 2023
-
[23]
Chenyu Jiang, Zhen Jia, Shuai Zheng, Yida Wang, and Chuan Wu. 2024. DynaPipe: Optimizing multi-task training through dynamic pipelines. InProceedings of the Nineteenth European Conference on Computer Systems. 542–559
work page 2024
-
[24]
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al . 2024. {MegaScale}: Scaling large language model training to more than 10,000 {GPUs }. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 745–760
work page 2024
- [25]
-
[26]
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems5 (2023), 341–353
work page 2023
-
[27]
Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. 2022. Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance. arXiv:2107.02027 [cs.CL] https://arxiv.org/abs/2107.02027
-
[28]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[29]
Haoyang Li, Fangcheng Fu, Sheng Lin, Hao Ge, Xuanyu Wang, Jiawen Niu, Jinbao Xue, Yangyu Tao, Di Wang, Jie Jiang, and Bin Cui. 2025. Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment.Proc. ACM Manag. Data3, 6, Article 337 (Dec. 2025), 30 pages. doi:10.1145/3769802
- [30]
-
[31]
Zhiqi Lin, Youshan Miao, Guanbin Xu, Cheng Li, Olli Saarikivi, Saeed Maleki, and Fan Yang. 2024. Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search. InHPCA
work page 2024
-
[32]
Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, and Jiesheng Wu
-
[33]
PERSEUS: a fail-slow detection framework for cloud storage systems. In Proceedings of the 21st USENIX Conference on File and Storage Technologies(Santa Clara, CA, USA)(FAST’23). USENIX Association, USA, Article 4, 15 pages
-
[34]
Jeffrey Jian Ma, Hengzhi Pei, Leonard Lausen, and George Karypis. 2025. Un- derstanding silent data corruption in LLM training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 20372–20394
work page 2025
-
[35]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating systems principles. 1–15
work page 2019
-
[36]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron- LM. InProceedings of the International Conference for High ...
-
[37]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Biswaranjan Panda, Deepthi Srinivasan, Huan Ke, Karan Gupta, Vinayak Khot, and Haryadi S. Gunawi. 2019. IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services. In2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 47–62. https: //www.usenix.org/conference/atc19/presentation/panda
work page 2019
-
[39]
Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero bubble (almost) pipeline parallelism. InThe Twelfth International Conference on Learning Representations
work page 2024
-
[40]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[41]
Yu Sun, Zhu Zhu, Cherish Mulpuru, Roberto Gioiosa, Zhao Zhang, Bo Fang, and Lishan Yang. 2025. Ft2: First-token-inspired online fault tolerance on critical lay- ers for generative large language models. InProceedings of the 34th International Symposium on High-Performance Parallel and Distributed Computing. 1–14
work page 2025
-
[42]
John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. 2023. Bamboo: Making Pre- emptible Instances Resilient for Affordable Training of Large DNNs. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 497–513. https://www.useni...
work page 2023
-
[43]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Leslie G. Valiant. 1990. A bridging model for parallel computation.Commun. ACM33, 8 (Aug. 1990), 103–111. doi:10.1145/79173.79181
-
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL] https://arxiv.org/abs/1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [46]
-
[47]
Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, Yuchen Hao, and Yufei Ding. 2025. WLB-LLM: workload-balanced 4D parallelism for large language model training. InProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation(Boston, MA, USA)(OSDI ’25). USENIX Asso...
work page 2025
-
[48]
BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model.arXiv preprint arXiv:2211.05100(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [49]
-
[50]
2025.{GREYHOUND}: Hunting {Fail-Slows} in {Hybrid-Parallel} Training at Scale
Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang. 2025.{GREYHOUND}: Hunting {Fail-Slows} in {Hybrid-Parallel} Training at Scale. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 731–747
work page 2025
-
[51]
Yifan Xiong, Yuting Jiang, Ziyue Yang, Lei Qu, Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, Jithin Jose, Hossein Pourreza, Jeff Baxter, Kushal Datta, Prabhat Ram, Luke Melton, Joe Chau, Peng Cheng, Yongqiang Xiong, and Lidong Zhou. 2024. SuperBench: improving cloud AI infrastructure reliability with proactive validation. In...
work page 2024
-
[52]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Cheng- peng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, and Yonggang Wen. 2024. Deep Learning Workload Scheduling in GPU Datacenters: A Survey.ACM Comput. Surv.56, 6, Article 146 (Jan. 2024), 38 pages
work page 2024
-
[55]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068 [cs.CL]...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[56]
Shenglin Zhang, Yongxin Zhao, Xiao Xiong, Yongqian Sun, Xiaohui Nie, Jiacheng Zhang, Fenglai Wang, Xian Zheng, Yuzhi Zhang, and Dan Pei. 2024. Illuminat- ing the gray zone: Non-intrusive gray failure localization in server operating systems. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 126–137
work page 2024
-
[57]
Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Daxin Jiang, and Xin Jin. 2025. DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models. InProceedings of the ACM SIGCOMM 2025 Conference(São Francisco Convent, Coimbra, Portugal)(SIGCOMM ’25). Association for Co...
-
[58]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan- ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. InOSDI
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.