pith. sign in

arxiv: 2605.27918 · v1 · pith:STM76AJ3new · submitted 2026-05-27 · 💻 cs.DC

Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain

Pith reviewed 2026-06-29 10:17 UTC · model grok-4.3

classification 💻 cs.DC
keywords multimodal LLM trainingdistributed trainingmodel parallelismload balancingheterogeneous workloadsmicrobatch schedulingvariability reduction
0
0 comments X

The pith

A single static model-parallel configuration suffices for optimal load balancing in multimodal LLM training when profiling shifts to macroscopic batches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Entrain addresses heterogeneity in multimodal LLM datasets by challenging the need for dynamic model parallelism. It shifts profiling from micro-level samples to macroscopic batches and proves that one static configuration achieves optimal balance. A hierarchical microbatch assignment then defers excess workload to stabilize variability within iterations. This leads to up to 10.6 times less workload variability across microbatches and 1.40 times better end-to-end throughput than baselines. The insight matters for efficient distributed training of models handling varied data modalities without complex dynamic adjustments.

Core claim

The paper establishes that under a macroscopic batch profiling paradigm, a single static model-parallel configuration is sufficient for optimal load balancing despite data variability and sample-level entanglement in multimodal datasets. It further introduces a hierarchical microbatch assignment algorithm that defers excess workload within each iteration to reduce variability across microbatches.

What carries the argument

The shift to macroscopic batch-level profiling that enables static model parallelism, paired with the hierarchical microbatch assignment algorithm.

If this is right

  • A single fixed model-parallel setup can replace dynamic reconfiguration for load balancing.
  • Workload variability across microbatches is reduced by up to 10.6×.
  • End-to-end training throughput improves by up to 1.40× over baselines.
  • Both inter-modality and batch-level variability are addressed without dynamic changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach may simplify system design by avoiding runtime reconfiguration overhead in other variable workloads.
  • Future systems could test if batch-level profiling generalizes to non-multimodal heterogeneous training.
  • Hardware resource allocation might become more predictable with static configurations.

Load-bearing premise

Batch-level variability can be profiled accurately enough that sample-level entanglement does not require changing the model-parallel configuration dynamically.

What would settle it

Measuring workload variability when using the static configuration on datasets where sample-level effects dominate batch averages, to check if balancing fails.

Figures

Figures reproduced from arXiv: 2605.27918 by Insu Jang, Mosharaf Chowdhury.

Figure 1
Figure 1. Figure 1: Multimodal LLM architecture. We have implemented Entrain on top of PyTorch and Cornstarch [16]. We evaluate Entrain on vision-language models (based on Qwen2.5Vision with Llama3-1b and 3b models) across four multimodal datasets with distinct distri￾butions. Compared to DistTrain [52] and DIP [48], Entrain reduces workload variability across microbatches by up to 10.6×, improving end-to-end training through… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of different pipeline parallel schedules using 8 microbatches for a vision language model (VLM). SynthChartNet ChartQA CocoQA LLaVA-150k 32 64 128 256 512 1K Text tokens per sample 0 20 40 60 80 100 CDF (%) (a) Number of text tokens 32 64 128256512 1K 2K 4K 8K 16K Image tokens per sample 0 20 40 60 80 100 CDF (%) (b) Number of vision tokens [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distributions of number of vision and text tokens in various datasets. They are independently varying. This pipeline design requires strict temporal and spatial balance: execution times must remain consistent across con￾secutive microbatches, and all stages must have roughly equal execution times for a given microbatch. Violations in either dimension immediately produce pipeline bubbles and stragglers. Fig… view at source ↗
Figure 4
Figure 4. Figure 4: Workload ratio of vision encoder (Qwen2Vision) and LLM (Llama3-1B) across 100 samples in datasets. N=1 N=4 N=16 N=64 N=256 0 20 40 60 80 100 Batch index 1 2 3 4 Vision / Text Workload Ratio Dataset ratio: 2.43 [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Workload ratio variability with different sample sizes in LLaVA-150K dataset. The mean ratio of vision-to-text workload of the entire dataset is 2.43. With larger sample size 𝑁, the ratio between batches becomes more stable and converges to the dataset mean. Workload heterogeneity refers to the systematic difference in computational characteristics across modalities. Modality￾specific encoders and the LLM … view at source ↗
Figure 7
Figure 7. Figure 7: Entrain design overview. • Macroscopic profiling-based parallelization (§4): Entrain profiles large batch to derive a parallel configura￾tion that provides optimal load balance across modalities throughout training. • Microscopic workload balancing via deferral (§5): Inspired by the producer-consumer model, Entrain de￾couples microbatch partitioning across modalities and balances workload by deferring exce… view at source ↗
Figure 8
Figure 8. Figure 8: The hierarchical microbatch assignment algorithm. The number of boxes represents the amount of workload of each sample, where green boxes are for encoder workload and orange boxes are for LLM workload. analytical iteration time of pipeline schedule S [54]: TS [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A visualization of pairwise deferral optimization with microbatches in Figure 8b. 5.2 Pairwise Deferral Optimization After stratified sample assignment, Entrain balances LLM execution time across microbatches by deferring selected sample’s LLM computation from overloaded microbatches to underloaded ones, leaving the encoder schedule intact. After deferral, microbatches are paired and reordered so that each… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of pipeline parallelism schedule of microbatches in Figure 8c with and without backward depen￾dency optimization. The algorithm returns 𝑇 ∗ and the complete matching P. Mi￾crobatches are then arranged into an interleaved execution sequence (𝑜𝑙0, 𝑢𝑙0, 𝑜𝑙1, 𝑢𝑙1, . . .) per P (line 15), with matched pairs transferring selected samples’ LLM workload from the overloaded to the immediately followi… view at source ↗
Figure 11
Figure 11. Figure 11: End-to-end training performance of Entrain and the baselines. Encoder Forward Encoder Backward LLM Forward LLM Backward Stage 7 Stage 6 Stage 5 Stage 4 Stage 3 Stage 2 Stage 1 Stage 0 (a) DistTrain pipeline schedule visualization. Stage 7 Stage 6 Stage 5 Stage 4 Stage 3 Stage 2 Stage 1 Stage 0 (b) DIP pipeline schedule visualization. Stage 7 Stage 6 Stage 5 Stage 4 Stage 3 Stage 2 Stage 1 Stage 0 (c) Entr… view at source ↗
Figure 12
Figure 12. Figure 12: Pipeline schedule visualization of En￾train and the baselines on SynthChartNet dataset on Qwen2.5Vision+Llama3-3b VLM. non-deferral pipeline schedules. However, increased mem￾ory consumption is minuscule because pairwise schedule only holds encoder activations for up to a single microbatch interval, and activation memory is very small compared to the total memory consumption. Rather, more balanced workloa… view at source ↗
Figure 14
Figure 14. Figure 14: Sensitivity analysis on SynthChartNet dataset. Ra￾tios are encoder-to-LLM workload ratios [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Variability of modality forward time across micro￾batches in SynthChartNet dataset on VLMs. Each line repre￾sents a DP replica. the sum of forward time of all corresponding pipeline stages. DistTrain’s heuristic microbatch reordering algorithm fo￾cuses on mitigating pipeline bubbles from the holistic view, hence it fails to address the data variability between micro￾batches and shows high variability in b… view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of 3-stage pipeline parallel schedule without and with pairwise deferral optimization. From the binomial argument in Section B.1, the estima￾tor at batch size 𝑛 already lands in this region with high probability: Pr(w¯ 𝑛 ∈ 𝑉 (𝐶𝑟𝑒 𝑓 ) | 𝐶𝑟𝑒 𝑓 ) ≥ 1 − 𝑝𝑒𝑟𝑟𝑜𝑟 . (11) Because proportional allocation maps continuous work￾load ratios to integer GPU counts via rounding, each decision region 𝑉 (𝐶𝑟𝑒 𝑓 ) … view at source ↗
Figure 17
Figure 17. Figure 17: SynthChartNet on Qwen2.5Vision+Llama3-1b VLM. the follow-up microbatches, resulting in a more balanced pipeline schedule. Pairwise deferral only happens in LLM, thus the encoder schedule remains unchanged. D Parallel Configurations [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: LLaVA-150k on Qwen2.5Vision+Llama3-1b VLM. Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 2.0 4.0 6.0 8.0 10.0 Memory (GB) (a) Memory consumption of 1F1B pipeline schedule. Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 2.0 4.0 6.0 8.0 10.0 Memory (GB) (b) Memory consumption of DIP pipeline schedule. Iteration 0 Iteration 1 … view at source ↗
Figure 21
Figure 21. Figure 21: , [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: ChartQA on Qwen2.5Vision+Llama3-1b VLM. Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 1,000 2,000 3,000 4,000 5,000 6,000 7,000 Allocated Memory (MB) (a) Memory consumption of 1F1B pipeline schedule. Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Memory (GB) (b) Memory consumption of DIP pipeline… view at source ↗
Figure 23
Figure 23. Figure 23: CocoQA on Qwen2.5Vision+Llama3-1b VLM [PITH_FULL_IMAGE:figures/full_fig_p019_23.png] view at source ↗
Figure 26
Figure 26. Figure 26: Sensitivity analysis of the profiling batch size on CocoQA dataset. H Variability of Modality Forward Time Across Microbatches [PITH_FULL_IMAGE:figures/full_fig_p020_26.png] view at source ↗
Figure 24
Figure 24. Figure 24: , [PITH_FULL_IMAGE:figures/full_fig_p020_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Sensitivity analysis of the profiling batch size on ChartQA dataset. Qwen2Vision+Llama-1B Qwen2Vision+Llama-3B 0 5000 10000 15000 Iteration Time (ms) 7:1 1.50x 6:2 1.07x 5:3 4:4 1.03x 3:5 1.16x 5:3 1.14x 4:4 1.04x 3:5 2:6 1.13x 1:7 1.76x [PITH_FULL_IMAGE:figures/full_fig_p020_25.png] view at source ↗
Figure 28
Figure 28. Figure 28: Variability of modality forward time across micro￾batches in ChartQA dataset. 5 10 15 20 25 30 Microbatch index 100 200 Forward time (ms) (a) Qwen2.5Vision on CocoQA dataset. 5 10 15 20 25 30 Microbatch index 60 70 80 Forward time (ms) (b) Llama3-1b on CocoQA dataset. 5 10 15 20 25 30 Microbatch index 150 200 250 300 Forward time (ms) (c) Llama3-3b on CocoQA dataset [PITH_FULL_IMAGE:figures/full_fig_p021… view at source ↗
read the original abstract

Multimodal LLM datasets are inherently heterogeneous, with significant data variability. Although each modality exhibits independent variability, sample-level entanglement makes it difficult to balance workloads across both modalities and batches. We present Entrain, a distributed MLLM training framework that addresses both heterogeneity and variability in multimodal training workloads. Entrain challenges the intuition that dynamic data variability requires dynamic model parallelism by shifting the profiling paradigm from micro-level samples to macroscopic batches. We prove that a single, static model-parallel configuration suffices for optimal load balancing under this paradigm. At the microscopic scale, Entrain introduces a hierarchical microbatch assignment algorithm that defers excess workload within each iteration to stabilize variability across microbatches. Evaluations show that Entrain reduces workload variability across microbatches by up to 10.6$\times$, improving end-to-end training throughput by up to 1.40$\times$ over existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Entrain, a distributed training framework for multimodal LLMs that addresses data heterogeneity and variability across modalities and batches. It shifts profiling from micro-level samples to macroscopic batches and proves that a single static model-parallel configuration suffices for optimal load balancing. At the micro scale, it uses a hierarchical microbatch assignment algorithm to defer excess workload and stabilize variability within iterations. Evaluations claim up to 10.6× reduction in workload variability across microbatches and up to 1.40× end-to-end throughput improvement over baselines.

Significance. If the proof holds and the batch-level profiling assumption is valid despite sample entanglement, the result would be significant for distributed systems in multimodal training: it offers a simpler alternative to dynamic model parallelism, with the provided proof and quantitative gains as explicit strengths. This could reduce reconfiguration overhead in large-scale heterogeneous workloads.

major comments (1)
  1. [theoretical analysis section] The central proof that a single static model-parallel configuration suffices (theoretical analysis section) rests on the modeling choice that batch-level aggregates capture variability without sample-level entanglement dominating. No explicit bounds, assumptions, or conditions on entanglement effects at batch scale are provided to support that the static choice remains optimal across batches with high intra-batch variance, which directly undermines the optimality claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for clearer assumptions in the theoretical analysis. We address the concern below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [theoretical analysis section] The central proof that a single static model-parallel configuration suffices (theoretical analysis section) rests on the modeling choice that batch-level aggregates capture variability without sample-level entanglement dominating. No explicit bounds, assumptions, or conditions on entanglement effects at batch scale are provided to support that the static choice remains optimal across batches with high intra-batch variance, which directly undermines the optimality claim.

    Authors: The proof models workload at the batch level, where aggregate statistics across modalities allow a single static model-parallel partition to minimize expected imbalance; the hierarchical microbatch assignment then defers excess work within an iteration to bound realized variance. Sample entanglement is acknowledged in the manuscript but treated as a second-order effect once batch aggregates are fixed. We agree that the current text lacks explicit bounds or conditions on when entanglement could invalidate the static optimum. In revision we will add a dedicated subsection stating the modeling assumptions (e.g., bounded per-sample variance relative to batch size, Lipschitz continuity of modality workloads) and deriving a simple concentration bound showing that the static configuration remains within a constant factor of optimal with high probability when batch size exceeds a threshold derived from the variance model. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper claims a proof that shifting profiling to macroscopic batches makes a single static model-parallel configuration optimal for load balancing, with a hierarchical microbatch assignment algorithm to stabilize variability. No equations, definitions, or steps in the abstract or claims reduce this optimality result to fitted parameters, self-definitions, or self-citation chains by construction. The result is presented as an independent mathematical argument based on the paradigm shift, not as a renaming or prediction forced by inputs. The derivation is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework is described at the level of high-level design choices without detailing underlying mathematical assumptions or fitted constants.

pith-pipeline@v0.9.1-grok · 5678 in / 1081 out tokens · 43574 ms · 2026-06-29T10:17:25.310917+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 32 canonical work pages · 13 internal anchors

  1. [1]

    Inclusion AI and Ant Group. 2025. Ming-Omni: A Unified Multimodal Model for Perception and Generation. arXiv:2506.09344 [cs.AI]https: //arxiv.org/abs/2506.09344

  2. [2]

    Meta AI. 2024. The Llama 3 Herd of Models.https://arxiv.org/abs/ 2407.21783

  3. [3]

    Meta AI. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation.https://ai.meta.com/blog/llama-4- multimodal-intelligence/. [Accessed Feb 08, 2026]

  4. [4]

    Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. LongAlign: A Recipe for Long Context Alignment of Large Language Models. InEMNLP 24.https: //aclanthology.org/2024.findings-emnlp.74/

  5. [5]

    Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. 2025. FP4 All the Way: Fully Quantized Training of Large Language Models. InNeurIPS 25.https://neurips.cc/virtual/2025/loc/san-diego/poster/ 116331

  6. [6]

    Weiwei Chu, Xinfeng Xie, Jiecao Yu, Jie Wang, Amar Phanishayee, Chunqiang Tang, Yuchen Hao, Jianyu Huang, Mustafa Ozdal, Jun Wang, Vedanuj Goswami, Naman Goyal, Abhishek Kadian, Andrew Gu, Chris Cai, Feng Tian, Xiaodong Wang, Min Si, Pavan Balaji, Ching- Hsiang Chu, and Jongsoo Park. 2025. Scaling Llama 3 Training with Ef- ficient Parallelism Strategies. ...

  7. [7]

    Higham, Theo Mary, and Mantas Mikaitis

    Matteo Croci, Massimiliano Fasi, Nicholas J. Higham, Theo Mary, and Mantas Mikaitis. [n. d.]. Stochastic Rounding: implementa- tion, error analysis and applications.Royal Society Open Sci- ence([n. d.]). arXiv:https://royalsocietypublishing.org/rsos/article- pdf/doi/10.1098/rsos.211631/998073/rsos.211631.pdf doi:10.1098/rsos. 211631

  8. [8]

    Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, and Minlan Yu. 2025. Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation. InATC 25

  9. [9]

    Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. 2025. ByteScale: Communication- Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs. InSIGCOMM 25.https://doi.org/10.1145/3718958.3754352

  10. [10]

    Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. 2024. Enabling Parallelism Hot Switching for Efficient Training of Large Language Models. InSOSP 24. doi:10.1145/3694715.3695969

  11. [11]

    Google. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]https://arxiv.org/abs/2312.11805

  12. [12]

    Ronald Lewis Graham. 1969. Bounds on Multiprocessing Timing Anomalies.SIAM J. Appl. Math.17, 2 (1969), 416–429.https://doi.org/ 10.1137/0117039

  13. [13]

    Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Ying- tong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, and Xuanzhe Liu. 2024. LoongTrain: Efficient Training of Long-Sequence LLMs with Head- Context Parallelism.https://arxiv.org/abs/2406.18485

  14. [14]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Henni- gan, Eric Noland, Katie Millican, George van den Driessche, Bog- dan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent S...

  15. [15]

    Jun Huang, Zhen Zhang, Shuai Zheng, Feng Qin, and Yida Wang. 2024. DISTMM: Accelerating Distributed Multimodal Model Training. In NSDI 24.https://www.usenix.org/conference/nsdi24/presentation/ huang

  16. [16]

    Insu Jang, Runyu Lu, Nikhil Bansal, Ang Chen, and Mosharaf Chowd- hury. 2025. Efficient Distributed MLLM Training with Cornstarch. 13 arXiv:2503.11367 [cs.DC]https://arxiv.org/abs/2503.11367

  17. [17]

    Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowd- hury. 2023. Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. InSOSP 23. doi:10.1145/3600006.3613152

  18. [18]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Mod- els. arXiv:2001.08361 [cs.LG]https://arxiv.org/abs/2001.08361

  19. [19]

    Perez, and Andrew Fitzgib- bon

    Matej Kosec, Mario Michael Krell, Sergio P. Perez, and Andrew Fitzgib- bon. 2022. Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance. https://arxiv.org/abs/2107.02027

  20. [20]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2025. LLaVA-OneVision: Easy Visual Task Transfer.TMLR 25(2025). https://openreview.net/forum?id=zKv8qULV6n

  21. [21]

    Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023. Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. InICPP 23. doi:10.1145/3605573.3605613

  22. [22]

    Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023. Sequence Parallelism: Long Sequence Training from System Perspective. InACL 23. doi:10.18653/v1/2023.acl-long.134

  23. [23]

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. 2025. TorchTi- tan: One-stop PyTorch native solution for production ready LLM pre- training. InICLR 25.https://openreview.net/forum?id=SFN6Wm7YBI

  24. [24]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Im- proved Baselines with Visual Instruction Tuning. InCVPR 24. 26296– 26306

  25. [25]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. InNIPS 23.https://proceedings.neurips. cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0- Paper-Conference.pdf

  26. [26]

    Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. 2025. Ola: Pushing the Frontiers of Omni- Modal Language Model. arXiv:2502.04328 [cs.CV]https://arxiv.org/ abs/2502.04328

  27. [27]

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. 2024. DeepSeek-VL: Towards Real-World Vision-Language Understanding. arXiv:2403.05525 [cs.AI]https://arxiv.org/abs/2403.05525

  28. [28]

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Ena- mul Hoque. 2022. ChartQA: A Benchmark for Question Answer- ing about Charts with Visual and Logical Reasoning. InACL 22. https://aclanthology.org/2022.findings-acl.177/

  29. [29]

    Devanur, Gregory R

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. InSOSP 19. doi:10.1145/3341301.3359646

  30. [30]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InSC 21. doi:10.1145/3458817. 3476209

  31. [31]

    Said Gurbuz, Michele Dolfi, and Peter W

    Ahmed Nassar, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A. Said Gurbuz, Michele Dolfi, and Peter W. J. Staar. 2025. SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion. InICCV 25. 21972–21983

  32. [32]

    OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL]https: //arxiv.org/abs/2410.21276

  33. [33]

    OpenAI. 2026. OpenAI GPT-5 System Card. arXiv:2601.03267 [cs.CL] https://arxiv.org/abs/2601.03267

  34. [34]

    Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, and Peng Cheng. 2023. FP8-LM: Training FP8 Large Language Models. arXiv:2310.18313 [cs.LG]https: //arxiv.org/abs/2310.18313

  35. [35]

    Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2024. Zero Bubble (Almost) Pipeline Parallelism. InICLR 24.https://openreview. net/forum?id=tuzTN0eIO5

  36. [36]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He

  37. [37]

    InKDD 20

    DeepSpeed: System Optimizations Enable Training Deep Learn- ing Models with Over 100 Billion Parameters. InKDD 20. doi:10.1145/ 3394486.3406703

  38. [38]

    Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Explor- ing Models and Data for Image Question Answering. InNIPS 15.https://proceedings.neurips.cc/paper_files/paper/2015/file/ 831c2f88a604a07ca94314b56a4921b8-Paper.pdf

  39. [39]

    Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. 2019. Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks. In NeurIPS 19.https://proceedings.neurips.cc/paper_files/paper/2019/ file/65fc9fb4897a89789352e211ca2d398...

  40. [40]

    Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken. 2022. Unity: Ac- celerating DNN Training Through Joint Optimization of Algebraic Transformations and Paralleliza...

  41. [41]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InNIPS 17

  42. [42]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV]https://a...

  43. [43]

    Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. 2025. FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism. InASPLOS 25. doi:10.1145/3676641.3715998

  44. [44]

    Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, Yuchen Hao, and Yufei Ding. 2025. WLB-LLM: workload-balanced 4D parallelism for large language model training. InOSDI 25.https://www.usenix.org/ conference/osdi25/presentation/wang-zheng

  45. [45]

    Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and An- drés Marafioti. 2025. FineVision: Open Data Is All You Need.https: //arxiv.org/abs/2510.17269

  46. [46]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl- vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush

  47. [47]

    In EMNLP 20

    Transformers: State-of-the-Art Natural Language Processing. In EMNLP 20. 14

  48. [48]

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. 2024. DeepSeek-VL2: Mixture-of-Experts...

  49. [49]

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  50. [50]

    Zhenliang Xue, Hanpeng Hu, Xing Chen, Yimin Jiang, Yixin Song, Zeyu Mi, Yibo Zhu, Daxin Jiang, Yubin Xia, and Haibo Chen. 2026. DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline. InASPLOS 26. doi:10.1145/3779212.3790154

  51. [51]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  52. [52]

    Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Jiahao Hu, Shuo Wu, Yazhe Niu, Ruihao Gong, Dahua Lin, and Ningyi Xu

  53. [53]

    InNeurIPS 25.https://openreview.net/ forum?id=6BHDre6WQW

    Hierachical Balance Packing: Towards Efficient Supervised Fine- tuning for Long-Context LLM. InNeurIPS 25.https://openreview.net/ forum?id=6BHDre6WQW

  54. [54]

    Jinbin Zhang, Nasib Ullah, Erik Schultheis, and Rohit Babbar. 2025. ELMO : Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces. InICML 25.https://openreview.net/forum? id=d6CTIPrTTC

  55. [55]

    Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Daxin Jiang, and Xin Jin. 2025. DistTrain: Ad- dressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models. InSIGCOMM 25.https: //doi.org/10.1145/3718958.3750472

  56. [56]

    Siyan Zhao, Daniel Mingyi Israel, Guy Van den Broeck, and Aditya Grover. 2025. Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models. InAISTATS 25.https: //openreview.net/forum?id=LBVD4krAq2

  57. [57]

    Xing, Joseph E

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Au- tomating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. InOSDI 22.https://www.usenix.org/conference/osdi22/ presentation/zheng-lianmin 15 A Termina...