Recognition: no theorem link
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
Pith reviewed 2026-05-12 02:00 UTC · model grok-4.3
The pith
MegaScale-Omni uses decoupled parallelism and adaptive balancing to deliver up to 7.57 times higher throughput for dynamic multimodal LLM training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MegaScale-Omni is an industrial-grade MLLM training system built on encoder-LLM multiplexing. It applies long-short sequence parallelism to encoders for variable-length samples and full 5D parallelism to the LLM backbone under a communication-efficient layout. Unified representations support flexible colocation and a joint pipeline that adds workload resilience. Decentralized grouped reordering in data loaders together with adaptive resharding from encoder to LLM ranks handles balancing. The system is deployed for in-house tasks on thousands of GPUs and reports 1.27×-7.57× throughput gains versus four prior systems under production-grade dynamic workloads.
What carries the argument
Encoder-LLM multiplexing scheme that decouples long-short sequence parallelism for encoders from 5D parallelism for the LLM backbone, supported by unified representations, joint pipeline execution, and decentralized grouped reordering plus adaptive resharding for workload balance.
If this is right
- Throughput improves between 1.27 and 7.57 times under production-grade dynamic workloads compared to four existing systems.
- The system supports deployment at hyper-scale with thousands of GPUs for in-house MLLM training tasks.
- Decoupled strategies allow encoders to process variable-length samples independently of the LLM backbone's parallelization.
- Adaptive resharding and grouped reordering maintain efficiency as modality mixtures and sequence lengths change.
- Unified representations enable flexible colocation of encoders and the LLM without static coupling of their resource decisions.
Where Pith is reading between the lines
- The same decoupling pattern could reduce idle compute in other training pipelines that combine separate modality-specific models with a shared backbone.
- Production clusters might adopt the decentralized balancing layer to lower the manual tuning required when data distributions shift over time.
- If the joint pipeline scales cleanly, future multimodal systems could treat encoder and LLM stages as interchangeable modules rather than fixed coupled stages.
Load-bearing premise
The tested dynamic workloads and hardware environment are representative of broader production use cases, and the specific combination of long-short sequence parallelism, 5D LLM parallelism, and decentralized balancing generalizes without major additional engineering.
What would settle it
A side-by-side throughput measurement of MegaScale-Omni against the same four baseline systems on a fresh collection of production dynamic workload traces that include different modality mixture proportions and sequence length distributions than those used in the original experiments.
read the original abstract
As the foundational component of versatile AI applications, training an multimodal large language model (MLLM) relies on multimodal datasets with dynamic modality mixture proportions and sample length distributions. However, existing MLLM systems remain inefficient under dynamic workloads, due to statically coupled decisions of resource allocation and model parallelization between encoders and the LLM backbone. This paper presents MegaScale-Omni, an industrial-grade MLLM training system tailored for dynamic workload adaption and hyper-scale deployment. MegaScale-Omni is built upon the training scheme of encoder-LLM multiplexing with three key innovations: (1) Decoupled parallelism strategies with long-short sequence parallelism for encoders to process variable-length samples, and full-fledged 5D parallelism for the LLM backbone, both organized under a communication-efficient parallelization layout. (2) Unified encoder-LLM representations for flexible, extensible colocation, and a new paradigm of encoder-LLM joint pipeline with workload resilience. (3) Workload balancing techniques via decentralized grouped reordering in data loaders and adaptive resharding from encoder to LLM ranks. MegaScale-Omni is deployed as the foundation of our in-house large-scale MLLM training tasks with thousands of GPUs. Our experimental results demonstrate $1.27\times$-$7.57\times$ throughput improvement under production-grade dynamic workloads, as compared to four state-of-the-art systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MegaScale-Omni, an industrial MLLM training system for dynamic workloads with variable modality mixtures and sequence lengths. It uses encoder-LLM multiplexing with three innovations: decoupled long-short sequence parallelism for encoders plus 5D parallelism for the LLM backbone, unified representations enabling a joint pipeline, and decentralized grouped reordering plus adaptive resharding for balancing. The system is deployed on thousands of GPUs for in-house tasks and claims 1.27×–7.57× throughput gains versus four SOTA baselines under production-grade dynamic workloads.
Significance. If the empirical results hold under representative conditions, the work would be significant for production-scale MLLM training by addressing static coupling inefficiencies in resource allocation and parallelization. The reported scale of deployment and concrete throughput improvements over multiple baselines indicate potential practical impact on training efficiency for multimodal models with fluctuating workloads.
major comments (3)
- [Abstract] Abstract: The headline throughput gains (1.27×–7.57×) are stated without any description of the experimental setup, including hardware configuration (e.g., GPU types, interconnect), workload generation method, exact modality mixture proportions, sequence length distributions, or number of runs for statistical significance. This is load-bearing for the central claim because the gains constitute the primary evidence of superiority.
- [§5] §5 (Evaluation section): No tables or figures detail the tested dynamic workloads or sensitivity to parameters such as modality ratios and length variance; without these, it is impossible to evaluate whether the measured improvements generalize beyond the authors' in-house traces or are specific to the particular combination of long-short encoder parallelism, 5D LLM parallelism, and decentralized balancing.
- [§3] §3 (Design): The communication-efficient parallelization layout and adaptive resharding mechanism are described at a high level without quantitative analysis (e.g., communication volume reductions or load imbalance metrics before/after reordering), leaving unclear how much each innovation contributes to the reported gains versus baseline systems.
minor comments (2)
- [§3] The term '5D parallelism' is used without an explicit breakdown of the five dimensions or a diagram showing the mapping to encoder and LLM ranks.
- [§5] The four state-of-the-art baseline systems are named only in the abstract; their configurations and any modifications for fair comparison should be detailed in the evaluation section.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify opportunities to improve the clarity and completeness of our presentation. We address each major comment below and have revised the manuscript to incorporate the requested details and analyses.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline throughput gains (1.27×–7.57×) are stated without any description of the experimental setup, including hardware configuration (e.g., GPU types, interconnect), workload generation method, exact modality mixture proportions, sequence length distributions, or number of runs for statistical significance. This is load-bearing for the central claim because the gains constitute the primary evidence of superiority.
Authors: We agree that the abstract benefits from additional context on the experimental conditions. In the revised manuscript, we have expanded the abstract to briefly note the hardware (clusters of thousands of A100/H100 GPUs with high-bandwidth interconnects), the use of production traces for dynamic workloads with variable modality mixtures and sequence lengths, and that results are averaged over multiple runs. This provides necessary context for the reported gains while remaining within abstract length constraints. revision: yes
-
Referee: [§5] §5 (Evaluation section): No tables or figures detail the tested dynamic workloads or sensitivity to parameters such as modality ratios and length variance; without these, it is impossible to evaluate whether the measured improvements generalize beyond the authors' in-house traces or are specific to the particular combination of long-short encoder parallelism, 5D LLM parallelism, and decentralized balancing.
Authors: We acknowledge the value of more detailed workload characterization. The revised manuscript adds a new subsection and accompanying tables/figures in §5 that describe the modality mixture proportions, sequence length distributions from the in-house production traces, and sensitivity analyses across varying modality ratios and length variances. These additions demonstrate that the throughput gains are robust across a range of dynamic conditions. revision: yes
-
Referee: [§3] §3 (Design): The communication-efficient parallelization layout and adaptive resharding mechanism are described at a high level without quantitative analysis (e.g., communication volume reductions or load imbalance metrics before/after reordering), leaving unclear how much each innovation contributes to the reported gains versus baseline systems.
Authors: We agree that quantitative breakdowns would better isolate the contributions of each technique. In the revised §3, we have added measurements of communication volume reductions from the decoupled long-short sequence parallelism and 5D LLM layout, as well as load imbalance metrics before and after decentralized grouped reordering and adaptive resharding. An ablation study has also been included to quantify the impact of each innovation relative to the baselines. revision: yes
Circularity Check
No circularity: empirical system description with no derivations or self-referential predictions
full rationale
The paper describes an engineering system for MLLM training, introducing three innovations (decoupled parallelism, unified representations with joint pipeline, and workload balancing via decentralized reordering and resharding) and reports empirical throughput gains (1.27×–7.57×) from production deployment on thousands of GPUs versus four SOTA baselines. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text or abstract. Claims rest on external measurements rather than internal definitions or self-citation chains that reduce the result to its inputs. The representativeness concern raised in the skeptic note is an external validity issue, not a circularity reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions about communication costs, hardware homogeneity, and workload characteristics in large-scale GPU clusters.
Reference graph
Works this paper leans on
-
[1]
https://github.com/ NVIDIA/TransformerEngine, 2022
Nvidia transformer engine. https://github.com/ NVIDIA/TransformerEngine, 2022
work page 2022
-
[2]
https://github.com/ InternLM/InternLM-techreport, 2023
Internlm technical report. https://github.com/ InternLM/InternLM-techreport, 2023
work page 2023
-
[3]
Nvidia context parallel package. https://docs. nvidia.com/megatron-core/developer-guide/ latest/api-guide/context_parallel.html, 2025
work page 2025
-
[4]
https://deepmind.google/ models/gemini/flash/, 2025
Gemini 2.5 flash. https://deepmind.google/ models/gemini/flash/, 2025
work page 2025
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical report,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
S. Boettcher and S. Mertens. Analysis of the karmarkar-karp differencing algorithm. The European Physical Journal B, 65(1):131–140, Au- gust 2008. ISSN 1434-6036. doi: 10.1140/epjb/ e2008-00320-9. URL http://dx.doi.org/10.1140/ epjb/e2008-00320-9
-
[7]
Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bar- gav Jayaraman, Mark Ibrahim, Melissa Hall, Yun- yang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, HuXu, XiaoqingEllenTan, M...
-
[8]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, S...
-
[9]
Liwen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, et al. Flux: Fast software-based communication over- lap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858, 2024
-
[10]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Patch n’ pack: Navit, a vision trans- former for any aspect ratio and resolution, 2023
Mostafa Dehghani, Basil Mustafa, Josip Djo- longa, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Pi- otr Padlewski, Alexey Gritsenko, Mario Lučić, and Neil Houlsby. Patch n’ pack: Navit, a vision trans- former for any aspect ratio and resolution, 2023. URLhttps://a...
-
[12]
An image is worth 16x16 words: Transformers for image recognition at scale,
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, SylvainGelly, JakobUszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale,
-
[13]
URLhttps://arxiv.org/abs/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[14]
Optimus: Acceler- ating large-scale multi-modal llm training by bubble exploitation
Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, and Minlan Yu. Optimus: Acceler- ating large-scale multi-modal llm training by bubble exploitation. arXiv preprint arXiv:2408.03505, 2024
-
[15]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...
work page internal anchor Pith review arXiv 2025
-
[16]
Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus,
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xi- aonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin 17 Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus,
- [17]
-
[18]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xia...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Step-audio: Unified understanding and generation in intelligent speech interaction, 2025
Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jing- bei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Ranchen Ming, Song Y...
-
[20]
DISTMM: Accelerating distributed multimodal model training
Jun Huang, Zhen Zhang, Shuai Zheng, Feng Qin, and Yida Wang. DISTMM: Accelerating distributed multimodal model training. In 21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 1157–1171, Santa Clara, CA, April 2024. USENIX Association. ISBN978-1-939133-39-7. URL https://www.usenix. org/conference/nsdi24/presentation/huang
work page 2024
-
[21]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural net- works using pipeline parallelism.Advancesin neural information processing systems, 32, 2019
work page 2019
-
[22]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of ex- 18 treme long sequence transformer models, 2023. URL https://arxiv.org/abs/2309.14509
work page internal anchor Pith review arXiv 2023
-
[23]
Ganger, Tianqi Chen, and Zhihao Jia
Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mo- hammad Alizadeh, Gregory R. Ganger, Tianqi Chen, and Zhihao Jia. Graphpipe: Improving performance and scalability of dnn training with graph pipeline parallelism, 2024. URL https://arxiv.org/abs/ 2406.17145
-
[24]
Dynapipe: Optimizing multi- task training through dynamic pipelines
Chenyu Jiang, Zhen Jia, Shuai Zheng, Yida Wang, and Chuan Wu. Dynapipe: Optimizing multi- task training through dynamic pipelines. In Proceedings oftheNineteenthEuropean Conference on Computer Systems, pages 542–559, 2024
work page 2024
-
[25]
Megascale: Scaling large language model training to more than 10,000 gpus,
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...
- [26]
-
[27]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URLhttps://arxiv.org/ abs/1312.6114
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Reducing activation recomputation in large transformer models, 2022
Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022. URLhttps://arxiv.org/abs/2205.05198
-
[29]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classifica- tion, object detection, and visual relationship de- tection at scale.International Journal of Computer Vision,...
work page 1956
-
[30]
URL http: //dx.doi.org/10.1007/s11263-020-01316-z
doi: 10.1007/s11263-020-01316-z. URL http: //dx.doi.org/10.1007/s11263-020-01316-z
-
[31]
arXiv preprint arXiv:2006.15704 , author =
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch dis- tributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020
-
[32]
Ter- apipe: Token-level pipeline parallelism for train- ing large-scale language models
Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. Ter- apipe: Token-level pipeline parallelism for train- ing large-scale language models. In International Conference on Machine Learning, pages 6543–6552. PMLR, 2021
work page 2021
-
[33]
Mogao: An omni founda- tion model for interleaved multi-modal generation,
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni founda- tion model for interleaved multi-modal generation,
- [34]
-
[35]
{nnScaler}:{Constraint- Guided} parallelization plan generation for deep learning training
Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, et al. {nnScaler}:{Constraint- Guided} parallelization plan generation for deep learning training. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 347–363, 2024
work page 2024
-
[36]
ZeLiu, JiaNing, YueCao, YixuanWei, ZhengZhang, Stephen Lin, and Han Hu. Video swin transformer,
- [37]
-
[38]
Pipedream: Generalized pipeline parallelism for dnn training
Deepak Narayanan, Aaron Harlap, Amar Phan- ishayee, Vivek Seshadri, Nikhil R Devanur, Gre- gory R Ganger, Phillip B Gibbons, and Matei Za- haria. Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019
work page 2019
-
[39]
Effi- cient large-scale language model training on gpu clusters using megatron-lm
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Effi- cient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking,...
work page 2021
-
[40]
OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker- Whitcomb, Alex Beutel, Alex Borzunov, Alex Car- ney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexan- der Kirillov, Alexi Christakis,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Librispeech: An ASR corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964
-
[42]
Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023
Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023
-
[43]
Zero: Memory optimizations to- ward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations to- ward training trillion parameter models. InSC20: 20 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020
work page 2020
-
[44]
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ah- mad Awan, Jeff Rasley, and Yuxiong He. Deepspeed- moe: Advancing mixture-of-experts inference and trainingtopowernext-generationaiscale, 2022. URL https://arxiv.org/abs/2201.05596
-
[45]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Associa- tion for Computing Machiner...
-
[46]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[47]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, et al. Llama: Open and effi- cient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Man- deep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. Unity: Accelerat- ing {DNN} training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implement...
work page 2022
-
[49]
Bytecheckpoint: A unified checkpointing system for llm development
Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for llm development. arXiv preprint arXiv:2407.20143, 2024
-
[50]
Dingdong Wang, Mingyu Cui, Dongchao Yang, Xueyuan Chen, and Helen Meng. A compara- tive study of discrete speech tokens for semantic- related tasks with large language models, 2024. URL https://arxiv.org/abs/2411.08742
-
[51]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s per- ception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Overlap communication with de- pendent computation via decomposition in large deep learning models
Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al. Overlap communication with de- pendent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support forProgrammingLangu...
work page 2022
-
[53]
Spindle: Efficient distributed training of multi-task large models via wavefront scheduling,
Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, and Bin Cui. Spindle: Efficient distributed training of multi-task large models via wavefront scheduling,
- [54]
-
[55]
Wlb-llm: Workload-balanced 4d parallelism for large language model training, 2025
Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, Yuchen Hao, and Yufei Ding. Wlb-llm: Workload-balanced 4d parallelism for large language model training, 2025. URLhttps://arxiv. org/abs/2503.17924
-
[56]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Perric Cistac, Clara Ma, Yacine Jernite, Julien Plu, Can- wen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computa- tional Linguistics...
work page 2020
-
[57]
Adaptra: Straggler-resilient hybrid-parallel training with pipeline adaptation, 2025
Tianyuan Wu, Lunxi Cao, Hanfeng Lu, Xiaoxiao Jiang, Yinghao Yu, Siran Yang, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang, and Wei Wang. Adaptra: Straggler-resilient hybrid-parallel training with pipeline adaptation, 2025. URLhttps: //arxiv.org/abs/2504.19232
-
[58]
Zhenliang Xue, Hanpeng Hu, Xing Chen, Yimin Jiang, Yixin Song, Zeyu Mi, Yibo Zhu, Daxin Jiang, Yubin Xia, and Haibo Chen. Pipeweaver: Addressing data dynamicity in large multimodal model training with dynamic interleaved pipeline, 2025. URLhttps: //arxiv.org/abs/2504.14145
-
[59]
Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, and Xie Chen. Gigaspeech 2: An evolving, large-scale and multi- domain asr corpus for low-resource languages with automated crawling, transcription and refinement,
- [60]
-
[61]
Licheng Yu, Patrick Poirson, Shan Yang, Alexan- der C. Berg, and Tamara L. Berg. Modeling con- 21 text in referring expressions, 2016. URL https: //arxiv.org/abs/1608.00272
-
[62]
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ram- abhadran, Tara Sainath, Pedro Moreno, Chung- Cheng Chiu, Johan Schalkwyk, Franç...
-
[63]
Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, and Xin Jin. Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models. arXiv preprint arXiv:2408.04275, 2024
-
[64]
Overlord: Ultimate scaling of dataloader for multi-source large foundation model training,
Juntao Zhao, Qi Lu, Wei Jia, Borui Wan, Lei Zuo, Junda Feng, Jianyu Jiang, Yangrui Chen, Shuaishuai Cao, Jialing He, Kaihua Jiang, Yuanzhe Hu, Shibiao Nong, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Overlord: Ultimate scaling of dataloader for multi-source large foundation model training,
-
[65]
URLhttps://arxiv.org/abs/2504.09844
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data par- allel. arXiv preprint arXiv:2304.11277, 2023
work page internal anchor Pith review arXiv 2023
-
[67]
Alpa: Automating inter-and intra-operator parallelism for distributed deep learning
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yong- hao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022. 22
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.