arxiv: 2605.08962 · v1 · submitted 2026-05-09 · 💻 cs.DC

Recognition: no theorem link

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

Chunyu Xue , Yangrui Chen , Jianyu Jiang , Ningxin Zheng , Junda Feng , Jingji Chen , Shixiong Zhao , Shen Yan

show 9 more authors

Yi Lin Lei Shi Zanbo Wang Lishu Luo Faming Wu Haibin Lin Xin Liu Yanghua Peng Quan Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:00 UTC · model grok-4.3

classification 💻 cs.DC

keywords multimodal LLM trainingMLLM systemsdynamic workloadsparallelism decouplingworkload balancingencoder-LLM multiplexinglarge-scale trainingthroughput optimization

0 comments

The pith

MegaScale-Omni uses decoupled parallelism and adaptive balancing to deliver up to 7.57 times higher throughput for dynamic multimodal LLM training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents MegaScale-Omni, a system for training multimodal large language models at large scale when the data includes varying proportions of different modalities and samples of different lengths. Existing approaches tie resource decisions too rigidly to the model structure, causing inefficiency as workloads change. The new design separates how encoders handle short and long sequences from the main model's five-dimensional parallel setup, while adding unified data representations and decentralized methods to rebalance work across processors. A sympathetic reader would care because successful adaptation to dynamic conditions could make large-scale training more practical and less wasteful in real production settings with thousands of GPUs.

Core claim

MegaScale-Omni is an industrial-grade MLLM training system built on encoder-LLM multiplexing. It applies long-short sequence parallelism to encoders for variable-length samples and full 5D parallelism to the LLM backbone under a communication-efficient layout. Unified representations support flexible colocation and a joint pipeline that adds workload resilience. Decentralized grouped reordering in data loaders together with adaptive resharding from encoder to LLM ranks handles balancing. The system is deployed for in-house tasks on thousands of GPUs and reports 1.27×-7.57× throughput gains versus four prior systems under production-grade dynamic workloads.

What carries the argument

Encoder-LLM multiplexing scheme that decouples long-short sequence parallelism for encoders from 5D parallelism for the LLM backbone, supported by unified representations, joint pipeline execution, and decentralized grouped reordering plus adaptive resharding for workload balance.

If this is right

Throughput improves between 1.27 and 7.57 times under production-grade dynamic workloads compared to four existing systems.
The system supports deployment at hyper-scale with thousands of GPUs for in-house MLLM training tasks.
Decoupled strategies allow encoders to process variable-length samples independently of the LLM backbone's parallelization.
Adaptive resharding and grouped reordering maintain efficiency as modality mixtures and sequence lengths change.
Unified representations enable flexible colocation of encoders and the LLM without static coupling of their resource decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling pattern could reduce idle compute in other training pipelines that combine separate modality-specific models with a shared backbone.
Production clusters might adopt the decentralized balancing layer to lower the manual tuning required when data distributions shift over time.
If the joint pipeline scales cleanly, future multimodal systems could treat encoder and LLM stages as interchangeable modules rather than fixed coupled stages.

Load-bearing premise

The tested dynamic workloads and hardware environment are representative of broader production use cases, and the specific combination of long-short sequence parallelism, 5D LLM parallelism, and decentralized balancing generalizes without major additional engineering.

What would settle it

A side-by-side throughput measurement of MegaScale-Omni against the same four baseline systems on a fresh collection of production dynamic workload traces that include different modality mixture proportions and sequence length distributions than those used in the original experiments.

read the original abstract

As the foundational component of versatile AI applications, training an multimodal large language model (MLLM) relies on multimodal datasets with dynamic modality mixture proportions and sample length distributions. However, existing MLLM systems remain inefficient under dynamic workloads, due to statically coupled decisions of resource allocation and model parallelization between encoders and the LLM backbone. This paper presents MegaScale-Omni, an industrial-grade MLLM training system tailored for dynamic workload adaption and hyper-scale deployment. MegaScale-Omni is built upon the training scheme of encoder-LLM multiplexing with three key innovations: (1) Decoupled parallelism strategies with long-short sequence parallelism for encoders to process variable-length samples, and full-fledged 5D parallelism for the LLM backbone, both organized under a communication-efficient parallelization layout. (2) Unified encoder-LLM representations for flexible, extensible colocation, and a new paradigm of encoder-LLM joint pipeline with workload resilience. (3) Workload balancing techniques via decentralized grouped reordering in data loaders and adaptive resharding from encoder to LLM ranks. MegaScale-Omni is deployed as the foundation of our in-house large-scale MLLM training tasks with thousands of GPUs. Our experimental results demonstrate $1.27\times$-$7.57\times$ throughput improvement under production-grade dynamic workloads, as compared to four state-of-the-art systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents MegaScale-Omni, an industrial MLLM training system for dynamic workloads with variable modality mixtures and sequence lengths. It uses encoder-LLM multiplexing with three innovations: decoupled long-short sequence parallelism for encoders plus 5D parallelism for the LLM backbone, unified representations enabling a joint pipeline, and decentralized grouped reordering plus adaptive resharding for balancing. The system is deployed on thousands of GPUs for in-house tasks and claims 1.27×–7.57× throughput gains versus four SOTA baselines under production-grade dynamic workloads.

Significance. If the empirical results hold under representative conditions, the work would be significant for production-scale MLLM training by addressing static coupling inefficiencies in resource allocation and parallelization. The reported scale of deployment and concrete throughput improvements over multiple baselines indicate potential practical impact on training efficiency for multimodal models with fluctuating workloads.

major comments (3)

[Abstract] Abstract: The headline throughput gains (1.27×–7.57×) are stated without any description of the experimental setup, including hardware configuration (e.g., GPU types, interconnect), workload generation method, exact modality mixture proportions, sequence length distributions, or number of runs for statistical significance. This is load-bearing for the central claim because the gains constitute the primary evidence of superiority.
[§5] §5 (Evaluation section): No tables or figures detail the tested dynamic workloads or sensitivity to parameters such as modality ratios and length variance; without these, it is impossible to evaluate whether the measured improvements generalize beyond the authors' in-house traces or are specific to the particular combination of long-short encoder parallelism, 5D LLM parallelism, and decentralized balancing.
[§3] §3 (Design): The communication-efficient parallelization layout and adaptive resharding mechanism are described at a high level without quantitative analysis (e.g., communication volume reductions or load imbalance metrics before/after reordering), leaving unclear how much each innovation contributes to the reported gains versus baseline systems.

minor comments (2)

[§3] The term '5D parallelism' is used without an explicit breakdown of the five dimensions or a diagram showing the mapping to encoder and LLM ranks.
[§5] The four state-of-the-art baseline systems are named only in the abstract; their configurations and any modifications for fair comparison should be detailed in the evaluation section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify opportunities to improve the clarity and completeness of our presentation. We address each major comment below and have revised the manuscript to incorporate the requested details and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: The headline throughput gains (1.27×–7.57×) are stated without any description of the experimental setup, including hardware configuration (e.g., GPU types, interconnect), workload generation method, exact modality mixture proportions, sequence length distributions, or number of runs for statistical significance. This is load-bearing for the central claim because the gains constitute the primary evidence of superiority.

Authors: We agree that the abstract benefits from additional context on the experimental conditions. In the revised manuscript, we have expanded the abstract to briefly note the hardware (clusters of thousands of A100/H100 GPUs with high-bandwidth interconnects), the use of production traces for dynamic workloads with variable modality mixtures and sequence lengths, and that results are averaged over multiple runs. This provides necessary context for the reported gains while remaining within abstract length constraints. revision: yes
Referee: [§5] §5 (Evaluation section): No tables or figures detail the tested dynamic workloads or sensitivity to parameters such as modality ratios and length variance; without these, it is impossible to evaluate whether the measured improvements generalize beyond the authors' in-house traces or are specific to the particular combination of long-short encoder parallelism, 5D LLM parallelism, and decentralized balancing.

Authors: We acknowledge the value of more detailed workload characterization. The revised manuscript adds a new subsection and accompanying tables/figures in §5 that describe the modality mixture proportions, sequence length distributions from the in-house production traces, and sensitivity analyses across varying modality ratios and length variances. These additions demonstrate that the throughput gains are robust across a range of dynamic conditions. revision: yes
Referee: [§3] §3 (Design): The communication-efficient parallelization layout and adaptive resharding mechanism are described at a high level without quantitative analysis (e.g., communication volume reductions or load imbalance metrics before/after reordering), leaving unclear how much each innovation contributes to the reported gains versus baseline systems.

Authors: We agree that quantitative breakdowns would better isolate the contributions of each technique. In the revised §3, we have added measurements of communication volume reductions from the decoupled long-short sequence parallelism and 5D LLM layout, as well as load imbalance metrics before and after decentralized grouped reordering and adaptive resharding. An ablation study has also been included to quantify the impact of each innovation relative to the baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with no derivations or self-referential predictions

full rationale

The paper describes an engineering system for MLLM training, introducing three innovations (decoupled parallelism, unified representations with joint pipeline, and workload balancing via decentralized reordering and resharding) and reports empirical throughput gains (1.27×–7.57×) from production deployment on thousands of GPUs versus four SOTA baselines. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text or abstract. Claims rest on external measurements rather than internal definitions or self-citation chains that reduce the result to its inputs. The representativeness concern raised in the skeptic note is an external validity issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an applied systems paper that relies on standard distributed training assumptions rather than new theoretical constructs.

axioms (1)

domain assumption Standard assumptions about communication costs, hardware homogeneity, and workload characteristics in large-scale GPU clusters.
Invoked implicitly when claiming efficiency of the parallelization layout and balancing techniques.

pith-pipeline@v0.9.0 · 5606 in / 1226 out tokens · 31854 ms · 2026-05-12T02:00:03.442170+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 13 internal anchors

[1]

https://github.com/ NVIDIA/TransformerEngine, 2022

Nvidia transformer engine. https://github.com/ NVIDIA/TransformerEngine, 2022

work page 2022
[2]

https://github.com/ InternLM/InternLM-techreport, 2023

Internlm technical report. https://github.com/ InternLM/InternLM-techreport, 2023

work page 2023
[3]

https://docs

Nvidia context parallel package. https://docs. nvidia.com/megatron-core/developer-guide/ latest/api-guide/context_parallel.html, 2025

work page 2025
[4]

https://deepmind.google/ models/gemini/flash/, 2025

Gemini 2.5 flash. https://deepmind.google/ models/gemini/flash/, 2025

work page 2025
[5]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical report,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Catarina and B

S. Boettcher and S. Mertens. Analysis of the karmarkar-karp differencing algorithm. The European Physical Journal B, 65(1):131–140, Au- gust 2008. ISSN 1434-6036. doi: 10.1140/epjb/ e2008-00320-9. URL http://dx.doi.org/10.1140/ epjb/e2008-00320-9

work page doi:10.1140/epjb/ 2008
[7]

Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bar- gav Jayaraman, Mark Ibrahim, Melissa Hall, Yun- yang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, HuXu, XiaoqingEllenTan, M...

work page arXiv 2024
[8]

Internlm2 technical report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, S...

work page arXiv 2024
[9]

Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

Liwen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, et al. Flux: Fast software-based communication over- lap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858, 2024

work page arXiv 2024
[10]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Patch n’ pack: Navit, a vision trans- former for any aspect ratio and resolution, 2023

Mostafa Dehghani, Basil Mustafa, Josip Djo- longa, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Pi- otr Padlewski, Alexey Gritsenko, Mario Lučić, and Neil Houlsby. Patch n’ pack: Navit, a vision trans- former for any aspect ratio and resolution, 2023. URLhttps://a...

work page arXiv 2023
[12]

An image is worth 16x16 words: Transformers for image recognition at scale,

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, SylvainGelly, JakobUszko- reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale,

work page
[13]

URLhttps://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Optimus: Acceler- ating large-scale multi-modal llm training by bubble exploitation

Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, and Minlan Yu. Optimus: Acceler- ating large-scale multi-modal llm training by bubble exploitation. arXiv preprint arXiv:2408.03505, 2024

work page arXiv 2024
[15]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...

work page internal anchor Pith review arXiv 2025
[16]

Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus,

Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xi- aonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin 17 Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus,

work page
[17]

URLhttps://arxiv.org/abs/2502.21231

work page arXiv
[18]

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xia...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Step-audio: Unified understanding and generation in intelligent speech interaction, 2025

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jing- bei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Ranchen Ming, Song Y...

work page arXiv 2025
[20]

DISTMM: Accelerating distributed multimodal model training

Jun Huang, Zhen Zhang, Shuai Zheng, Feng Qin, and Yida Wang. DISTMM: Accelerating distributed multimodal model training. In 21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 1157–1171, Santa Clara, CA, April 2024. USENIX Association. ISBN978-1-939133-39-7. URL https://www.usenix. org/conference/nsdi24/presentation/huang

work page 2024
[21]

Gpipe: Efficient training of giant neural net- works using pipeline parallelism.Advancesin neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural net- works using pipeline parallelism.Advancesin neural information processing systems, 32, 2019

work page 2019
[22]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of ex- 18 treme long sequence transformer models, 2023. URL https://arxiv.org/abs/2309.14509

work page internal anchor Pith review arXiv 2023
[23]

Ganger, Tianqi Chen, and Zhihao Jia

Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mo- hammad Alizadeh, Gregory R. Ganger, Tianqi Chen, and Zhihao Jia. Graphpipe: Improving performance and scalability of dnn training with graph pipeline parallelism, 2024. URL https://arxiv.org/abs/ 2406.17145

work page arXiv 2024
[24]

Dynapipe: Optimizing multi- task training through dynamic pipelines

Chenyu Jiang, Zhen Jia, Shuai Zheng, Yida Wang, and Chuan Wu. Dynapipe: Optimizing multi- task training through dynamic pipelines. In Proceedings oftheNineteenthEuropean Conference on Computer Systems, pages 542–559, 2024

work page 2024
[25]

Megascale: Scaling large language model training to more than 10,000 gpus,

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...

work page
[26]

URLhttps://arxiv.org/abs/2402.15627

work page arXiv
[27]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URLhttps://arxiv.org/ abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Reducing activation recomputation in large transformer models, 2022

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022. URLhttps://arxiv.org/abs/2205.05198

work page arXiv 2022
[29]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classifica- tion, object detection, and visual relationship de- tection at scale.International Journal of Computer Vision,...

work page 1956
[30]

URL http: //dx.doi.org/10.1007/s11263-020-01316-z

doi: 10.1007/s11263-020-01316-z. URL http: //dx.doi.org/10.1007/s11263-020-01316-z

work page doi:10.1007/s11263-020-01316-z
[31]

arXiv preprint arXiv:2006.15704 , author =

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch dis- tributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020

work page arXiv 2006
[32]

Ter- apipe: Token-level pipeline parallelism for train- ing large-scale language models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. Ter- apipe: Token-level pipeline parallelism for train- ing large-scale language models. In International Conference on Machine Learning, pages 6543–6552. PMLR, 2021

work page 2021
[33]

Mogao: An omni founda- tion model for interleaved multi-modal generation,

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni founda- tion model for interleaved multi-modal generation,

work page
[34]

URLhttps://arxiv.org/abs/2505.05472

work page arXiv
[35]

{nnScaler}:{Constraint- Guided} parallelization plan generation for deep learning training

Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, et al. {nnScaler}:{Constraint- Guided} parallelization plan generation for deep learning training. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 347–363, 2024

work page 2024
[36]

Video swin transformer,

ZeLiu, JiaNing, YueCao, YixuanWei, ZhengZhang, Stephen Lin, and Han Hu. Video swin transformer,

work page
[37]

URLhttps://arxiv.org/abs/2106.13230

work page arXiv
[38]

Pipedream: Generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phan- ishayee, Vivek Seshadri, Nikhil R Devanur, Gre- gory R Ganger, Phillip B Gibbons, and Matei Za- haria. Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

work page 2019
[39]

Effi- cient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Effi- cient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking,...

work page 2021
[40]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker- Whitcomb, Alex Beutel, Alex Borzunov, Alex Car- ney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexan- der Kirillov, Alexi Christakis,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015
[42]

Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023

Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023

work page arXiv 2023
[43]

Zero: Memory optimizations to- ward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations to- ward training trillion parameter models. InSC20: 20 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

work page 2020
[44]

Deepspeed- moe: Advancing mixture-of-experts inference and trainingtopowernext-generationaiscale, 2022

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ah- mad Awan, Jeff Rasley, and Yuxiong He. Deepspeed- moe: Advancing mixture-of-experts inference and trainingtopowernext-generationaiscale, 2022. URL https://arxiv.org/abs/2201.05596

work page arXiv 2022
[45]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Associa- tion for Computing Machiner...

work page doi:10.1145/3394486.3406703 2020
[46]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[47]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, et al. Llama: Open and effi- cient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Unity: Accelerat- ing {DNN} training through joint optimization of algebraic transformations and parallelization

Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Man- deep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. Unity: Accelerat- ing {DNN} training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implement...

work page 2022
[49]

Bytecheckpoint: A unified checkpointing system for llm development

Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Bytecheckpoint: A unified checkpointing system for llm development. arXiv preprint arXiv:2407.20143, 2024

work page arXiv 2024
[50]

A compara- tive study of discrete speech tokens for semantic- related tasks with large language models, 2024

Dingdong Wang, Mingyu Cui, Dongchao Yang, Xueyuan Chen, and Helen Meng. A compara- tive study of discrete speech tokens for semantic- related tasks with large language models, 2024. URL https://arxiv.org/abs/2411.08742

work page arXiv 2024
[51]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s per- ception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Overlap communication with de- pendent computation via decomposition in large deep learning models

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al. Overlap communication with de- pendent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support forProgrammingLangu...

work page 2022
[53]

Spindle: Efficient distributed training of multi-task large models via wavefront scheduling,

Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, and Bin Cui. Spindle: Efficient distributed training of multi-task large models via wavefront scheduling,

work page
[54]

URLhttps://arxiv.org/abs/2409.03365

work page arXiv
[55]

Wlb-llm: Workload-balanced 4d parallelism for large language model training, 2025

Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, Yuchen Hao, and Yufei Ding. Wlb-llm: Workload-balanced 4d parallelism for large language model training, 2025. URLhttps://arxiv. org/abs/2503.17924

work page arXiv 2025
[56]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Perric Cistac, Clara Ma, Yacine Jernite, Julien Plu, Can- wen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computa- tional Linguistics...

work page 2020
[57]

Adaptra: Straggler-resilient hybrid-parallel training with pipeline adaptation, 2025

Tianyuan Wu, Lunxi Cao, Hanfeng Lu, Xiaoxiao Jiang, Yinghao Yu, Siran Yang, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang, and Wei Wang. Adaptra: Straggler-resilient hybrid-parallel training with pipeline adaptation, 2025. URLhttps: //arxiv.org/abs/2504.19232

work page arXiv 2025
[58]

Pipeweaver: Addressing data dynamicity in large multimodal model training with dynamic interleaved pipeline.arXiv preprint arXiv:2504.14145,

Zhenliang Xue, Hanpeng Hu, Xing Chen, Yimin Jiang, Yixin Song, Zeyu Mi, Yibo Zhu, Daxin Jiang, Yubin Xia, and Haibo Chen. Pipeweaver: Addressing data dynamicity in large multimodal model training with dynamic interleaved pipeline, 2025. URLhttps: //arxiv.org/abs/2504.14145

work page arXiv 2025
[59]

Gigaspeech 2: An evolving, large-scale and multi- domain asr corpus for low-resource languages with automated crawling, transcription and refinement,

Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, and Xie Chen. Gigaspeech 2: An evolving, large-scale and multi- domain asr corpus for low-resource languages with automated crawling, transcription and refinement,

work page
[60]

URLhttps://arxiv.org/abs/2406.11546

work page arXiv
[61]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexan- der C. Berg, and Tamara L. Berg. Modeling con- 21 text in referring expressions, 2016. URL https: //arxiv.org/abs/1608.00272

work page arXiv 2016
[62]

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ram- abhadran, Tara Sainath, Pedro Moreno, Chung- Cheng Chiu, Johan Schalkwyk, Franç...

work page arXiv 2023
[63]

Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models

Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, and Xin Jin. Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models. arXiv preprint arXiv:2408.04275, 2024

work page arXiv 2024
[64]

Overlord: Ultimate scaling of dataloader for multi-source large foundation model training,

Juntao Zhao, Qi Lu, Wei Jia, Borui Wan, Lei Zuo, Junda Feng, Jianyu Jiang, Yangrui Chen, Shuaishuai Cao, Jialing He, Kaihua Jiang, Yuanzhe Hu, Shibiao Nong, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Overlord: Ultimate scaling of dataloader for multi-source large foundation model training,

work page
[65]

URLhttps://arxiv.org/abs/2504.09844

work page internal anchor Pith review Pith/arXiv arXiv
[66]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data par- allel. arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review arXiv 2023
[67]

Alpa: Automating inter-and intra-operator parallelism for distributed deep learning

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yong- hao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022. 22

work page 2022