Heterogeneous Parallelism for Multimodal Large Language Model Training
Pith reviewed 2026-06-29 18:39 UTC · model grok-4.3
The pith
Multimodal LLMs can assign independent parallelism layouts to encoders and core models, raising TFLOPS per GPU by up to 49.3 percent on shared GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Heterogeneous parallelism lets modules in one end-to-end graph use independent layouts and rank placements. Boundary communicators implement forward and backward layout transforms that preserve tensor semantics across those layouts. The design supports colocated execution on shared GPUs and non-colocated execution on disjoint rank sets, with added scheduling logic for each mode. Evaluation across multimodal workloads and GPU scales shows colocated heterogeneity improves TFLOPS per GPU by up to 49.3 percent while non-colocated heterogeneity improves aggregate token throughput by up to 13.0 percent and TFLOPS per GPU by up to 9.6 percent, with loss convergence parity to homogeneous baselines.
What carries the argument
Boundary communicators that materialize forward activations for the destination layout and route backward gradients back to the source layout.
If this is right
- Colocated heterogeneous configurations can raise TFLOPS per GPU by up to 49.3 percent.
- Non-colocated heterogeneous configurations can raise aggregate token throughput by up to 13.0 percent.
- Non-colocated heterogeneous configurations can raise TFLOPS per GPU by up to 9.6 percent.
- Loss convergence remains equivalent to homogeneous baselines across the tested workloads.
- The gains appear across different multimodal workloads and GPU cluster scales.
Where Pith is reading between the lines
- The same boundary handling could be tested on models that contain more than two distinct modules, such as separate vision, audio, and language components.
- Dynamic rank allocation that assigns hardware clusters sized to each module's optimal layout becomes feasible once boundary costs are controlled.
- The approach may reduce the hardware homogeneity requirement for large training runs when modules have mismatched scaling needs.
Load-bearing premise
Boundary communicators can move activations and gradients between independent layouts without introducing correctness errors or prohibitive overhead.
What would settle it
An experiment that applies heterogeneous layouts to a multimodal workload and measures either higher final loss than the matched homogeneous run or no gain in TFLOPS per GPU would falsify the performance claims.
read the original abstract
Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layout increasingly limits throughput. This coupling forces encoders to inherit LLM-driven sharding and placement choices that can add communication, limit encoder parallelism, or constrain the LLM schedule; the mismatch is most pronounced at long contexts, where LLM context parallelism is needed for the fused multimodal sequence but encoder inputs remain bounded. We present heterogeneous parallelism for multimodal large language model training, an abstraction that lets modules in one end-to-end graph use independent layouts and rank placements, supporting colocated execution on shared GPUs and non-colocated execution on disjoint rank sets. The key challenge is preserving boundary tensor semantics across independent layouts: forward activations must be materialized for the destination layout, while backward gradients must be routed back to the source layout. We address this with boundary communicators that implement forward and backward layout transforms, plus scheduling extensions for both placement modes. We evaluate optimized homogeneous, colocated heterogeneous, and non-colocated heterogeneous configurations across multimodal workloads and GPU scales to characterize when added layout and placement freedom exposes a better operating point. Across this sweep, colocated heterogeneity improves TFLOPS/GPU by up to 49.3%, while non-colocated heterogeneity improves aggregate token throughput by up to 13.0% and TFLOPS/GPU by up to 9.6%. We validate loss convergence parity against homogeneous baselines and release the system as an open-source Megatron-LM extension.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces heterogeneous parallelism for multimodal LLM training, allowing independent TP/CP/PP/DP/EP layouts and rank placements for different modules (e.g., encoders vs. LLM) within one end-to-end graph. It addresses cross-layout tensor semantics via boundary communicators that materialize forward activations and route backward gradients, plus scheduling extensions for colocated (shared-GPU) and non-colocated (disjoint ranks) modes. Empirical evaluation across workloads reports colocated heterogeneity improving TFLOPS/GPU by up to 49.3%, non-colocated improving aggregate token throughput by up to 13.0% and TFLOPS/GPU by up to 9.6%, with loss convergence parity to homogeneous baselines; the system is released as an open-source Megatron-LM extension.
Significance. If the reported gains hold after accounting for boundary communicator costs, the work could meaningfully improve efficiency in multimodal pretraining and post-training by decoupling modality-specific parallelism choices, especially at long contexts. The open-source release is a clear strength supporting reproducibility.
major comments (2)
- [Boundary communicators design] Boundary communicators design: The central TFLOPS/GPU claims (49.3% colocated, 9.6% non-colocated) rest on the assumption that layout transforms add only negligible communication and synchronization overhead. No quantitative breakdown of added all-to-all or point-to-point volume, nor scaling with context length or layout divergence (e.g., CP active on LLM but not encoder), is provided to bound this cost.
- [Evaluation] Evaluation: The peak improvements are stated without accompanying details on the exact model sizes, GPU counts, workload characteristics, number of runs, variance, or conditions achieving the 49.3%, 13.0%, and 9.6% figures, nor any error analysis, which prevents full assessment of robustness against the homogeneous baselines.
minor comments (1)
- [Abstract] The abstract refers to results 'across this sweep' without enumerating the range of configurations or workloads tested.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The two major comments highlight areas where additional analysis and experimental details would strengthen the presentation. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Boundary communicators design] Boundary communicators design: The central TFLOPS/GPU claims (49.3% colocated, 9.6% non-colocated) rest on the assumption that layout transforms add only negligible communication and synchronization overhead. No quantitative breakdown of added all-to-all or point-to-point volume, nor scaling with context length or layout divergence (e.g., CP active on LLM but not encoder), is provided to bound this cost.
Authors: We agree that an explicit quantitative breakdown of boundary communicator overhead would improve the paper. The current manuscript reports net end-to-end TFLOPS/GPU and throughput gains after all communication (including boundary transforms), which implicitly shows that the added cost is more than offset by the parallelism benefits. However, we did not include a dedicated micro-benchmark isolating all-to-all/point-to-point volume or its scaling with context length and layout divergence. In the revision we will add a new subsection (and associated figure) that measures this overhead across the evaluated workloads, including cases with and without CP on the encoder. revision: yes
-
Referee: [Evaluation] Evaluation: The peak improvements are stated without accompanying details on the exact model sizes, GPU counts, workload characteristics, number of runs, variance, or conditions achieving the 49.3%, 13.0%, and 9.6% figures, nor any error analysis, which prevents full assessment of robustness against the homogeneous baselines.
Authors: We acknowledge that the main text could more explicitly enumerate the precise configurations and statistical details behind the reported peaks. The evaluation section and appendix already contain the model sizes, GPU counts, and workload descriptions, and all experiments were run with multiple seeds; however, variance and error bars are not shown in the primary figures. We will revise the evaluation section to list the exact conditions for each peak number, state the number of runs (three), and add error bars or standard-deviation shading to the relevant plots. revision: yes
Circularity Check
No circularity detected; results are direct empirical measurements
full rationale
The paper is a systems/engineering contribution that introduces boundary communicators and scheduling extensions for heterogeneous parallelism, then reports measured TFLOPS/GPU and throughput deltas from implemented colocated and non-colocated configurations. No mathematical derivation chain, fitted parameters presented as predictions, or load-bearing self-citations exist in the provided text. All performance claims are framed as outcomes of direct execution on multimodal workloads rather than reductions to inputs by construction. The loss-convergence parity check is an external validation step, not a self-referential fit.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tensor semantics must be preserved across independent parallelism layouts via explicit forward and backward transforms
invented entities (1)
-
boundary communicators
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Chiang, Z
W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https://lmsys.org/blog/2023-03-30-vicuna/, March 2023
2023
-
[2]
Using Cornstarch 5d parallelism.https://cornstarch-org.github.io/ parallelization/cornstarch_parallel/, 2025
Cornstarch Project. Using Cornstarch 5d parallelism.https://cornstarch-org.github.io/ parallelization/cornstarch_parallel/, 2025. Accessed: 2026-05-06
2025
-
[3]
W. Feng, Y. Chen, S. Wang, Y. Peng, H. Lin, and M. Yu. Optimus: Accelerating large-scale multi-modal LLM training by bubble exploitation. In2025 USENIX Annual Technical Conference (USENIX ATC 25). USENIX Association, 2025. URLhttps://www.usenix.org/conference/atc25/ presentation/feng
2025
-
[4]
Huang, Z
J. Huang, Z. Zhang, S. Zheng, F. Qin, and Y. Wang. DISTMM: Accelerating distributed multimodal model training. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1157–1171. USENIX Association,
-
[5]
URLhttps://www.usenix.org/conference/ nsdi24/presentation/huang
-
[6]
I. Jang, R. Lu, N. Bansal, A. Chen, and M. Chowdhury. Efficient distributed MLLM training with Cornstarch.arXiv preprint arXiv:2503.11367, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [7]
-
[8]
C. Jiang, Z. Cai, Y. Tian, Z. Jia, Y. Wang, and C. Wu. DCP: Addressing input dynamism in long-context training via dynamic context parallelism. InACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP 25), 2025. doi: 10.1145/3731569.3764849
- [9]
-
[10]
D. Narayanan et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. arXiv preprint arXiv:2104.04473, 2021
-
[11]
Y. Niu, H. Xiao, D. Liu, W. Zhou, and J. Li. DHP: Efficient scaling of MLLM training with dynamic hybrid parallelism.arXiv preprint arXiv:2602.21788, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Radford, J
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8748–8763. PM...
2021
-
[13]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. ZeRO: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054, 2020
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[14]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi et al. Megatron-LM training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[15]
S. Smith et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model.arXiv preprint arXiv:2201.11990, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [16]
-
[17]
Y. Wang, S. Zhu, F. Fu, X. Miao, J. Zhang, J. Zhu, F. Hong, Y. Li, and B. Cui. Efficient multi-task large model training via data heterogeneity-aware model management.Proceedings of the VLDB Endowment, 18(1), 2025
2025
- [18]
-
[19]
Z. Xue, H. Hu, X. Chen, Y. Jiang, Y. Song, Z. Mi, Y. Zhu, D. Jiang, Y. Xia, and H. Chen. PipeWeaver: Addressing data dynamicity in large multimodal model training with dynamic interleaved pipeline. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2026. doi: 10.1145/37...
- [20]
-
[21]
J. Zhao, Q. Lu, W. Jia, B. Wan, L. Zuo, J. Feng, J. Jiang, Y. Chen, S. Cao, J. He, K. Jiang, Y. Hu, S. Nong, Y. Peng, H. Lin, and C. Wu. MegaScale-Data: Scaling dataloader for multisource 10 Heterogeneous Parallelism for Multimodal Large Language Model Training large foundation model training.arXiv preprint arXiv:2504.09844, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Y. Zhao et al. PyTorch FSDP: Experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 11 Heterogeneous Parallelism for Multimodal Large Language Model Training A. Appendix This appendix reports the layouts, peak memory, step times, tokens/s gains, and TFLOPS/GPU gains supporting Section 4. All parallelism columns useTP/CP/...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
language
Each colocated bridge group is therefore a 2-rank all-gather, not a global DP32 or DP16 collective. This small-group fan-in explains why colocated forward bridge time is low. NC forward is larger because it includes both cross-island activation transfer and LLM-side activation fanout: the encoder island sends activations to the LLM receiver ranks, and tho...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.