Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing

Anbang Wu; Chen Chen; Chunyu Xue; Qizhen Weng; Quan Chen; Yanbo Wang; Yin Chen; Yu Feng; Yuxuan Wang

arxiv: 2605.18710 · v1 · pith:MBWEQDKAnew · submitted 2026-05-18 · 💻 cs.DC

Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing

Yanbo Wang , Yuxuan Wang , Chen Chen , Chunyu Xue , Yu Feng , Anbang Wu , Quan Chen , Yin Chen

show 1 more author

Qizhen Weng

This is my paper

Pith reviewed 2026-05-20 07:49 UTC · model grok-4.3

classification 💻 cs.DC

keywords multimodal modelstraining efficiencyGPU resource sharingspatial multiplexingperformance modelingheuristic allocationdistributed training

0 comments

The pith

Multimodal models train faster when multiple modules share each GPU under controlled resource shares.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that giving an entire GPU to one module at a time leaves most of the hardware idle because individual modules rarely saturate all resources. It proposes instead to run several modules together on the same GPU while assigning each a fixed share of compute and memory, which raises overall utilization and shortens total training time. Apollo realizes the idea through an execution engine that enforces arbitrary quotas, a performance model that predicts runtimes for different quota combinations, and heuristics that select good sharing plans without exhaustive search. Testbed results show speedups reaching 1.31 times on standard multimodal architectures. A reader cares because multimodal models keep growing in size and complexity, so any gain in hardware efficiency directly affects what can be trained in practice.

Core claim

Apollo deploys multimodal models with temporal-spatial multiplexing so that multiple modules colocate on a GPU under explicit resource quotas. A flexible execution engine supports arbitrary quotas, a performance model estimates execution time for each allocation, and heuristics use the model to produce high-quality deployment plans, yielding measured speedups up to 1.31x.

What carries the argument

The performance model that estimates module execution time under different resource allocation plans and feeds those estimates to heuristics that choose deployment plans.

If this is right

Wall-clock training time for popular multimodal models drops on a fixed number of GPUs.
Average GPU utilization rises because colocated modules fill idle capacity that a single module leaves unused.
Deployment plans can be generated quickly without testing every possible quota combination.
Resource shares can be tuned separately for each module type to balance heterogeneous workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quota-based sharing could apply to inference serving where batch sizes and latency targets differ from training.
Fewer total GPUs might suffice for a given training workload, reducing both monetary cost and energy consumption.
Pairing the static performance model with runtime measurements could enable dynamic quota adjustments as training progresses.

Load-bearing premise

The performance model accurately predicts how execution time changes when resource shares are varied.

What would settle it

Running Apollo on a new multimodal model and finding that its chosen allocation plan produces no speedup or a slowdown compared with the standard one-module-per-GPU baseline.

Figures

Figures reproduced from arXiv: 2605.18710 by Anbang Wu, Chen Chen, Chunyu Xue, Qizhen Weng, Quan Chen, Yanbo Wang, Yin Chen, Yu Feng, Yuxuan Wang.

**Figure 2.** Figure 2: Edge-grade MMs with high modal complexity. Model Module Layers Dim. TFLOPs CI Qwen3-VL (8.1B) Qwen3LLM 36 4096 22.27 145.2 Vision 27 4096 2.58 82.4 Text 1 4096 0.15 2.1 Unified-IO 2 (3.8B) UIO-2 LLM 48 3072 16.70 110.5 Vision 11 768 1.48 24.6 Audio 11 768 1.06 21.8 Text 1 3072 0.10 4.5 ImageBind (1.2B) Vision 24 1024 4.17 35.2 Audio 12 768 2.09 22.8 Text 12 768 1.04 20.5 OFASys (6.3B) OFASys LLM 36 1280 4.… view at source ↗

**Figure 3.** Figure 3: Behaviors of different MM deployment schemes when training the CLIP model on four GPUs. V, T, and A denote vision-encoder, text-encoder and alignment modules. of typical MMs, including their Compute Intensity (CI). Table 1 suggests that the per-module compute intensity can vary by over an order of magnitude across different modules, exhibiting strong cross-module heterogeneity. For example, for the Qwen3-… view at source ↗

**Figure 6.** Figure 6: Green Context is lightweight in both memory and time overhead. for each module with the desired SM quota. Nonetheless, creating or destroying such GC-stream would incur nonnegligible overheads in the critical path (e.g., for reclaiming the stream objects as well as the associated GC states). Our later testbed evaluations (Fig. 11b) show that, when training the Imagebind model on 8 × H100, it takes up to 6… view at source ↗

**Figure 7.** Figure 7: MM modules’ scaling curves are smooth with respect to DP Degree and SM Ratio. collocated onto the same GPU. These challenges render existing solutions inadequate and here we respectively address them. Towards comprehensive modeling with symmetry-based solution pruning and smoothness-based grid sampling. To overcome the first challenge, Mosaic builds a scaling surface through symmetry-based pruning and smo… view at source ↗

**Figure 8.** Figure 8: Memory bandwidth contention does degrade performance, and simple linear modeling is insufficient. We train two modules (text-encoder and audio-encoder) from the OFASys model on an H100 GPU. GPU set G to minimize stage latency. The resulting iteration time is 𝑇iteration (S, G) = Í 𝑆𝑖 ∈S 𝑇stage (𝑆𝑖 , G). This decision has two coupled levels. The upper level decides which independent modules should be group… view at source ↗

**Figure 9.** Figure 9: depicts the average per-iteration time of each MM under different deployment methods. It confirms that Mosaic consistently achieves the best training efficiency, with 1.07×– 1.31× speedup over Spindle (the second best), 1.10×–1.42× over DistMM, and 1.17×–1.48× over Megatron-LM. Such superiority aligns with our previous analysis in Sec. 2.2. Moreover, we also notice that the performance benefit of Mosaic … view at source ↗

**Figure 10.** Figure 10: GPU utilization when training different MMs. Mosaic does achieve the highest GPU utilization for all the models: on average, the GPU utilization under Mosaic is 47.0%, yet under Spindle, DistMM, and Megatron-LM, they are respectively 38.3%, 31.0%, and 26.8%. In particular, for the highly-heterogeneous OFASys model, the most underutilized GPU under Spindle, which hosts the IMU module, has a utilization of… view at source ↗

**Figure 12.** Figure 12: Mosaic’s interference-aware performance model outperforms the other baseline methods in both prediction error and end-to-end performance. 2 4 6 8 10 12 14 16 18 20 Number of Modules 10 −2 10 −1 10 0 10 1 10 2 10 3 Search Time (s) Timeout (> Module 6) Brute-force GAHC GAHC+caching Mosaic mapping-solver (a) Search time ablation study. 2 4 6 8 10 12 14 16 18 20 Number of Modules 90 92 94 96 98 100 Optimal Ra… view at source ↗

**Figure 13.** Figure 13: Mosaic’s search algorithm finds near-optimal solution with high time efficiency. the interference-unaware performance model. Fig. 12b reveals that the performance model in Eq. 8 achieves a training speedup of 11%–24%. Moreover, that speedup level increases when the OFASys model comprises more modules, because with more modules, the cross-module performance interference is more salient. In summary, our i… view at source ↗

**Figure 14.** Figure 14: b reports the resulting search time and median optimality ratio compared to the optimal plan found by exhaustive enumeration. Coarse granularities substantially reduce search overhead but compromise the solution quality: at 30% and 20%, the search finishes in 5.32 s and 8.46 s, while 8 16 32 Number of GPUs 1.0 1.5 2.0 2.5 Normalized Throughput Mosaic Spindle DistMM Megatron (a) Throughput under differen… view at source ↗

read the original abstract

With the wide adoption of Multimodal Models (MMs) in real-world scenarios, it is significant to efficiently train emerging MMs that exhibit increasingly complex module architectures. For MM deployment, existing works allocate a GPU to only one MM module in a temporal-multiplexing manner; this compromises training efficiency because a single module often fails to achieve high GPU utilization. To improve GPU utilization and enable efficient MM training, we propose deploying MMs in a temporal-spatial multiplexing manner, allowing multiple MM modules to colocate on a GPU with well-controlled resource quotas. In this paper, we propose Apollo, an efficient MM training system that applies temporal-spatial multiplexing. We first develop a flexible and lightweight execution engine that supports MM training with arbitrary resource quotas, and then build a comprehensive and accurate performance model to estimate module execution time under different allocation plans. With the performance model, we further adopt effective heuristics to derive high-quality MM deployment plans efficiently. Testbed experiments confirm that Apollo effectively improves the training efficiency of popular MMs, with a training speedup of up to 1.31x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Apollo (Mosaic) gives a practical way to colocate MM modules on GPUs with a performance model and heuristics, but the speedup rests on unvalidated predictions for new allocations.

read the letter

The core takeaway is that this work builds a system to run multiple modules from multimodal models on one GPU at once, using controlled resource shares instead of strict time-slicing, and reports up to 1.31x faster training on testbeds for popular models. The execution engine handles arbitrary quotas, a performance model estimates runtimes for different plans, and heuristics pick the deployments. That combination is the actual new piece here, applied specifically to the complex module structures in current MMs. It addresses a clear utilization gap that temporal-only approaches leave on the table, and the testbed results show measurable gains without needing new hardware. Credit for shipping something that runs and produces numbers on real workloads rather than stopping at simulation. The soft spot is the performance model itself. The abstract calls it comprehensive and accurate, yet gives no numbers on prediction error, how it was fit, or checks against colocation cases outside its development set. If those estimates drift for unseen quotas, the heuristics can pick plans that underperform the baseline, which undercuts the main claim. The stress-test note on this point lands because the provided text does not show independent validation. Baselines and workload details are also thin in what is visible. This is for systems people who train or deploy large multimodal models and want to stretch existing GPUs further. A reader working on resource schedulers or training efficiency would pick up usable ideas from the engine and heuristics. It is worth sending to peer review because the problem is timely, the empirical direction is concrete, and the gaps are fixable with more model diagnostics and clearer experimental controls rather than fatal to the contribution.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Apollo, a training system for multimodal models (MMs) that shifts from temporal multiplexing (one module per GPU) to temporal-spatial multiplexing, allowing multiple modules to colocate on a GPU under controlled resource quotas. It introduces a flexible execution engine supporting arbitrary quotas, a performance model to predict module execution time for different allocation plans, heuristics that use the model to select deployment plans, and testbed results claiming up to 1.31x training speedup on popular MMs.

Significance. If the performance model is shown to be accurate for unseen colocation scenarios and the speedups are reproducible with proper controls, the work could meaningfully improve GPU utilization for training complex, module-heavy multimodal models. The practical systems contribution of a lightweight engine plus heuristic planning is relevant to the distributed systems and ML systems communities.

major comments (2)

[Performance Model] Performance model (as described after the execution engine): the manuscript states the model is 'comprehensive and accurate' yet supplies no information on whether it is analytical or learned, which features or counters it uses, the fitting or training procedure, or measured prediction error (e.g., MAPE) on colocation workloads not seen during development. Because the heuristics directly consume the model's estimates to choose plans, systematic mis-prediction for novel allocations can produce deployments whose real runtime exceeds the temporal-multiplexing baseline, directly undermining the central 1.31x speedup claim.
[Evaluation] Experimental results (abstract and evaluation section): the reported speedups lack any description of the exact baselines, the set of MMs and workloads chosen, whether error bars or multiple runs are reported, or how post-hoc tuning of quotas or heuristics was prevented. Without these controls the empirical support for the efficiency claim remains weak.

minor comments (2)

Clarify whether the system name 'Apollo' and the paper title 'Mosaic' refer to the same artifact or whether one is a prior name.
[Evaluation] Add a short table or paragraph listing the specific multimodal models, dataset sizes, and GPU types used in the testbed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity around the performance model and evaluation details. We address each major comment below and will revise the manuscript to incorporate the requested information.

read point-by-point responses

Referee: [Performance Model] Performance model (as described after the execution engine): the manuscript states the model is 'comprehensive and accurate' yet supplies no information on whether it is analytical or learned, which features or counters it uses, the fitting or training procedure, or measured prediction error (e.g., MAPE) on colocation workloads not seen during development. Because the heuristics directly consume the model's estimates to choose plans, systematic mis-prediction for novel allocations can produce deployments whose real runtime exceeds the temporal-multiplexing baseline, directly undermining the central 1.31x speedup claim.

Authors: We agree that the manuscript does not currently supply the requested details on the performance model. In the revised version, we will add a new subsection that specifies the model's construction (analytical or learned), the exact features and hardware counters employed, the profiling and fitting procedure, and quantitative accuracy results including MAPE measured on colocation scenarios held out from model development. This addition will allow readers to evaluate the model's suitability for guiding the heuristics and will directly address concerns about potential mis-prediction affecting the reported speedups. revision: yes
Referee: [Evaluation] Experimental results (abstract and evaluation section): the reported speedups lack any description of the exact baselines, the set of MMs and workloads chosen, whether error bars or multiple runs are reported, or how post-hoc tuning of quotas or heuristics was prevented. Without these controls the empirical support for the efficiency claim remains weak.

Authors: We concur that the evaluation section requires additional methodological details to strengthen reproducibility and support for the efficiency claims. In the revision, we will expand the evaluation to explicitly define the baselines (including the temporal-multiplexing configuration), enumerate the specific multimodal models and workloads used, report results across multiple independent runs with error bars, and describe the experimental protocol employed to prevent post-hoc tuning of quotas or heuristics. These changes will provide the necessary controls and transparency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical testbed validation

full rationale

The paper outlines a systems proposal: a flexible execution engine for arbitrary GPU quotas, a performance model estimating module execution times under allocation plans, heuristics to select deployment plans, and testbed experiments measuring up to 1.31x speedup on popular multimodal models. No equations, analytical derivations, fitted parameters renamed as predictions, or self-citations are described in the abstract or provided text that would reduce any result to its inputs by construction. The speedup claim is grounded in direct measurements rather than model-based predictions that could be tautological, rendering the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that resource quotas can be enforced without significant interference and that execution times are predictable from allocation plans.

pith-pipeline@v0.9.0 · 5741 in / 1136 out tokens · 36509 ms · 2026-05-20T07:49:13.735884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

arXiv, 2022

Bai, J., et al.Ofasys: A multi-modal multi-task learning system for building generalist models. arXiv, 2022. arXiv:2212.04408

work page arXiv 2022
[3]

Bai, S., Cai, Y., Chen, R., et al.Qwen3-vl technical report. arXiv,

work page
[4]

ACM Manag

B¨other, M., Robroek, T., Gsteiger, V., Holzinger, R., Ma, X., T¨oz¨un, P., and Klimovic, A.Modyn: Data-centric machine learning pipeline orchestration.Proc. ACM Manag. Data 3, 1 (Feb. 2025)

work page 2025
[5]

InProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp

Caron, M., Touvron, H., Misra, I., J´egou, H., Mairal, J., Bojanowski, P., and Joulin, A.Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp. 9650–9660

work page 2021
[6]

Chen, W., Li, Z., and Xin, S.Omnivlm: A token-compressed, sub- billion-parameter vision-language model for efficient on-device infer- ence, 2024

work page 2024
[7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., W ang, W., Cao, Y., et al.Expanding performance bound- aries of open-source multimodal models with model, data, and test- time scaling. arXiv, 2024. arXiv:2412.05271

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

InInternational Conference on Learning Representations(2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations(2021)

work page 2021
[9]

R., and Smith, H.Applied Regression Analysis, 3 ed

Draper, N. R., and Smith, H.Applied Regression Analysis, 3 ed. John Wiley & Sons, 1998

work page 1998
[10]

Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., V anhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., and Florence, P.PaLM-e: An embodied multimodal language model. InProceedings of the 40th In...

work page 2023
[11]

InProceedings of the 41st International Conference on Machine Learning(2024), vol

Duan, J., Lu, R., Duanmu, H., Li, X., Zhang, X., Lin, D., Stoica, I., and Zhang, H.MuxServe: Flexible spatial-temporal multiplexing for multiple LLM serving. InProceedings of the 41st International Conference on Machine Learning(2024), vol. 235 ofProceedings of Machine Learning Research, PMLR, pp. 11905–11917

work page 2024
[12]

In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), pp

Feng, W., Chen, Y., W ang, S., Peng, Y., Lin, H., and Yu, M.Optimus: Accelerating {Large-Scale} {Multi-Modal} {LLM} training by bubble exploitation. In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), pp. 161–177

work page 2025
[13]

InInternational Conference on Learning Representations (2024), B

Garg, S., Farajtabar, M., Pouransari, H., Vemulapalli, R., Mehta, S., Tuzel, O., Shankar, V., and Faghri, F.Tic-clip: Continual training of clip models. InInternational Conference on Learning Representations (2024), B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, Eds., vol. 2024, pp. 16649–16684

work page 2024
[14]

InProceedings of the 42nd International Conference on Machine Learning(13–19 Jul 2025), A

Ge, C., W ang, X., Zhang, Z., Chen, H., Fan, J., Huang, L., Xue, H., and Zhu, W.Dynamic mixture of curriculum LoRA experts for continual multimodal instruction tuning. InProceedings of the 42nd International Conference on Machine Learning(13–19 Jul 2025), A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Ed...

work page 2025
[15]

Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities

Gemini Team, Google. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities. Technical report, 2025

work page 2025
[16]

V., Joulin, A., and Misra, I.Imagebind: One embedding space to bind them all

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., and Misra, I.Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2023). [17]Google DeepMind. Gemini 3.1 Pro model card. Model card, 2026

work page 2023
[17]

In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(2022), USENIX Association, pp

Han, M., Zhang, H., Chen, R., and Chen, H.Microsecond-scale preemption for concurrent gpu-accelerated dnn inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(2022), USENIX Association, pp. 539–558

work page 2022
[18]

InProceedings of the 2025 USENIX 13 Wang et al

He, Y., and et al.LLMStation: Resource multiplexing in tuning and serving large language models. InProceedings of the 2025 USENIX 13 Wang et al. Annual Technical Conference(2025), USENIX Association

work page 2025
[19]

M., and Porikli, F.Distilling multi-modal large language models for autonomous driving

Hegde, D., Y asarla, R., Cai, H., Han, S., Bhattacharyya, A., Maha- jan, S., Liu, L., Garrepalli, R., Patel, V. M., and Porikli, F.Distilling multi-modal large language models for autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(June 2025), pp. 27575–27585

work page 2025
[20]

In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)(2024), pp

Huang, J., Zhang, Z., Zheng, S., Qin, F., and W ang, Y.{DISTMM}: Accelerating distributed multimodal model training. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)(2024), pp. 1157–1171

work page 2024
[21]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M., Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., and Chen, Z.Gpipe: Effi- cient training of giant neural networks using pipeline parallelism. InAdvances in Neural Information Processing Systems (NeurIPS)(2019). arXiv:1811.06965

work page internal anchor Pith review Pith/arXiv arXiv 2019
[22]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2024)

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., and Liu, Z.VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2024)

work page 2024
[23]

Jang, I., Lu, R., Bansal, N., Chen, A., and Chowdhury, M.Efficient distributed MLLM training with Cornstarch, 2025

work page 2025
[24]

R., Chen, T., and Jia, Z.Graphpipe: Improving performance and scalability of dnn training with graph pipeline parallelism

Jeon, B., Wu, M., Cao, S., Kim, S., Park, S., Aggarwal, N., Unger, C., Arfeen, D., Liao, P., Miao, X., Alizadeh, M., Ganger, G. R., Chen, T., and Jia, Z.Graphpipe: Improving performance and scalability of dnn training with graph pipeline parallelism. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages ...

work page 2025
[25]

J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E. P., Sanketi, P. R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., and Finn, C.OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning(2025), vol. 270 ofProceeding...

work page 2025
[26]

Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

Krell, M. M., Kosec, M., Perez, S. P., and Fitzgibbon, A.Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027(2022)

work page arXiv 2022
[27]

Megrez-omni technical report, 2025

Li, B., Li, Y., Li, Z., Liu, C., Liu, W., Niu, G., Tan, Z., Xu, H., Yao, Z., Yuan, T., Zhou, D., Zhuang, Y., Yan, S., Dai, G., and Wang, Y. Megrez-omni technical report, 2025

work page 2025
[28]

Li, S., Xue, F., Baranwal, C., Li, Y., and You, Y.Sequence paral- lelism: Long sequence training from system perspective.arXiv preprint arXiv:2105.13120(2021)

work page arXiv 2021
[29]

Li, S., Zhao, Y., V arma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., V aughan, B., Damania, P., et al.Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2006
[30]

Journal of Manufacturing Systems 85(2026), 531–556

Liu, C., Qian, Y., Tang, D., Zhu, H., Pang, J., and Cai, Q.From insight to action: Embodied multi-agent system integrating vision language model for digital twin-assisted human-robot collaborative assembly. Journal of Manufacturing Systems 85(2026), 531–556

work page 2026
[31]

J.Visual instruction tuning.Ad- vances in neural information processing systems 36(2023), 34892–34916

Liu, H., Li, C., Wu, Q., and Lee, Y. J.Visual instruction tuning.Ad- vances in neural information processing systems 36(2023), 34892–34916

work page 2023
[32]

InThe Twelfth International Conference on Learning Representations(2024)

Liu, H., Zaharia, M., and Abbeel, P.Ring attention with blockwise transformers for near-infinite context. InThe Twelfth International Conference on Learning Representations(2024)

work page 2024
[33]

Liu, Z., Dong, Y., Wang, J., Liu, Z., Hu, W., Lu, J., and Rao, Y.Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328(2025)

work page arXiv 2025
[34]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., and Kembhavi, A.Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

work page 2024
[35]

InProceedings of the 32nd ACM International Conference on Multimedia(2024), pp

Lu, S., Guo, L., W ang, W., Zhao, Z., Yue, T., Liu, J., and Liu, S.Collab- orative training of tiny-large vision language models. InProceedings of the 32nd ACM International Conference on Multimedia(2024), pp. 4928– 4937

work page 2024
[36]

Edge ai software market worth 8 .89 billion by 2031

MarketsandMarkets. Edge ai software market worth 8 .89 billion by 2031. Web Page, 2026

work page 2031
[37]

Kimi K2.5: Visual agentic intelligence

Moonshot AI. Kimi K2.5: Visual agentic intelligence. Technical blog, 2026

work page 2026
[38]

R., Ganger, G

Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Zaharia, M.Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)(2019)

work page 2019
[39]

Multi-instance gpu user guide

NVIDIA. Multi-instance gpu user guide. NVIDIA Documentation,

work page
[40]

Cuda c++ programming guide.https://docs.nvidia.com/cu da/cuda-c-programming-guide/, 2026

NVIDIA. Cuda c++ programming guide.https://docs.nvidia.com/cu da/cuda-c-programming-guide/, 2026

work page 2026
[41]

Cuda c++ programming guide: Green contexts

NVIDIA Corporation. Cuda c++ programming guide: Green contexts. https://docs.nvidia.com/cuda/cuda-programming-guide/04-special- topics/green-contexts.html, 2025. Accessed: 2026-02-04

work page 2025
[42]

Cuda multi-process service (mps) overview

NVIDIA Corporation. Cuda multi-process service (mps) overview. https://docs.nvidia.com/deploy/pdf/CUDA Multi Process Service Overview.pdf, 2025. Accessed: 2026-02-04

work page 2025
[43]

NVIDIA Corporation, 2026

NVIDIA Corporation.NVIDIA Nsight Systems User Guide: GPU Metrics. NVIDIA Corporation, 2026. Version 2026.2. [45]OpenAI. Introducing GPT-5.4. OpenAI product release, 2026

work page 2026
[44]

[47]Perron, L., and Furnon, V.Or-tools

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems 32(2019). [47]Perron, L., and Furnon, V.Or-tools. [48]Portes, J., Trott, A., Havens, S., King, D., Venigalla, A...

work page 2019
[45]

Qwen3.5-397B-A17B model card

Qwen Team. Qwen3.5-397B-A17B model card. Hugging Face model card, 2026

work page 2026
[46]

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I.Learning transferable visual models from natural lan- guage supervision. InProceedings of the 38th International Conference on Machine Learning(2021), M. Meila and T. Zhang, Eds., vol. 139 of Proceedings of ...

work page 2021
[47]

Y., Awan, A

Rajbhandari, S., Li, C., Y ao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., and He, Y.DeepSpeed-MoE: Advancing mixture-of- experts inference and training to power next-generation AI scale. In Proceedings of the 39th International Conference on Machine Learning (2022), vol. 162 ofProceedings of Machine Learning Research, PMLR, pp. 18332–18346

work page 2022
[48]

In Advances in Neural Information Processing Systems(2021), vol

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., and Hsieh, C.-J.Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems(2021), vol. 34

work page 2021
[49]

Shao, Z., Yu, Z., Yu, J., Ouyang, X., Zheng, L., Gai, Z., Wang, M., Kuang, Z., and Ding, J.Imp: Highly capable large multimodal models for mobile devices.IEEE Transactions on Multimedia 27(2025), 2961– 2974

work page 2025
[50]

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B.Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1909
[51]

S., Shen, H., and Iyer, A.USHER: Holistic interference avoidance for resource optimized ML inference

Shubha, S. S., Shen, H., and Iyer, A.USHER: Holistic interference avoidance for resource optimized ML inference. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)(2024), USENIX Association, pp. 947–964. 14 Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing

work page 2024
[52]

InComputer Vision – ECCV 2024 (2024), Springer, pp

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beiss- wenger, J., Luo, P., Geiger, A., and Li, H.DriveLM: Driving with graph visual question answering. InComputer Vision – ECCV 2024 (2024), Springer, pp. 256–274

work page 2024
[53]

Siru, C., Yuanchao, S., Cong, W., and Jiming, C.A survey on edge multimodal large models: compression, inference acceleration, and applications.National Science Open

work page
[54]

InProceedings of the Nineteenth European Conference on Computer Systems(2024), EuroSys ’24, Association for Computing Machinery, pp

Strati, F., Ma, X., and Klimovic, A.Orion: Interference-aware, fine-grained gpu sharing for ml applications. InProceedings of the Nineteenth European Conference on Computer Systems(2024), EuroSys ’24, Association for Computing Machinery, pp. 1075–1092

work page 2024
[55]

Qwen3.5-Omni Technical Report

Team, Q.Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

InProceedings of the European Conference on Computer Vision(2020), pp

Teed, Z., and Deng, J.RAFT: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision(2020), pp. 402–419

work page 2020
[57]

On-device multimodal ai market report 2026

The Business Research Company. On-device multimodal ai market report 2026. Market report, 2026

work page 2026
[58]

InProceedings of The 8th Conference on Robot Learning(06–09 Nov 2025), P

Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., and Zhao, H.Drivevlm: The convergence of autonomous driving and large vision-language models. InProceedings of The 8th Conference on Robot Learning(06–09 Nov 2025), P. Agrawal, O. Kroe- mer, and W. Burgard, Eds., vol. 270 ofProceedings of Machine Learning Research, PMLR, p...

work page 2025
[59]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., et al.Siglip 2: Multilingual vision-language encoders with improved semantic understanding. arXiv, 2025. arXiv:2502.14786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

A practitioner's guide to real-world continual multimodal pretrain- ing

Udandarao, V., Roth, K., Dziadzio, S., Prabhu, A., Cherti, M., Vinyals, O., H´enaff, O., Albanie, S., Akata, Z., and Bethge, M. A practitioner's guide to real-world continual multimodal pretrain- ing. InAdvances in Neural Information Processing Systems(2024), A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37,...

work page 2024
[61]

In2024 USENIX Annual Technical Conference (USENIX ATC 24)(Santa Clara, CA, July 2024), USENIX Association, pp

Um, T., Oh, B., Kang, M., Lee, W.-Y., Kim, G., Kim, D., Kim, Y., Muzza- mmil, M., and Jeon, M.Metis: Fast automatic distributed training on heterogeneous GPUs. In2024 USENIX Annual Technical Conference (USENIX ATC 24)(Santa Clara, CA, July 2024), USENIX Association, pp. 563–578

work page 2024
[62]

InProceedings of the 21st European Conference on Computer Systems (2026), pp

Wang, Y., Wang, Y., Chen, C., Xue, C., Weng, Q., Chen, Y., Li, Z., Zhu, X., Y ang, Y., Chen, Q., et al.Suika: Efficient and high-quality re-scheduling of 3d-parallelized llm training jobs in shared clusters. InProceedings of the 21st European Conference on Computer Systems (2026), pp. 2002–2021

work page 2026
[63]

arXiv, 2024

W ang, Y., Zhu, S., Fu, F., Miao, X., Zhang, J., Zhu, J., Hong, F., Li, Y., and Cui, B.Spindle: Efficient distributed training of multi-task large models via wavefront scheduling. arXiv, 2024. arXiv:2409.03365

work page arXiv 2024
[64]

Wen, Z., Gao, Y., Li, W., He, C., and Zhang, L.Token pruning in multimodal large language models: Are we solving the right problem?,

work page
[65]

Findings of ACL 2025

work page 2025
[66]

P., Huang, W.-C., Li, Y., Fang, L., W ang, Z., and Yu, P

Wu, Y., Li, D., Chen, Y., Jiang, R., Zou, H. P., Huang, W.-C., Li, Y., Fang, L., W ang, Z., and Yu, P. S.Multi-agent autonomous driving systems with large language models: A survey of recent advances, resources, and future directions.Findings of the Association for Computational Linguistics: EMNLP 2025(2025)

work page 2025
[67]

K., Li, Z., and Zhao, H.Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters 9, 10 (2024), 8186–8193

Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.-Y. K., Li, Z., and Zhao, H.Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters 9, 10 (2024), 8186–8193

work page 2024
[68]

InPro- ceedings of the 21st European Conference on Computer Systems(2026), pp

Xue, C., Chen, Y., Jiang, J., Zheng, N., Feng, J., Chen, J., Zhao, S., Y an, S., Lin, Y., Shi, L., et al.Megascale-omni: A hyper-scale, workload- resilient system for multimodal llm training in production. InPro- ceedings of the 21st European Conference on Computer Systems(2026), pp. 675–692

work page 2026
[69]

arXiv, 2025

Xue, Z., Hu, H., Chen, X., Jiang, Y., Song, Y., Mi, Z., Zhu, Y., Jiang, D., Xia, Y., and Chen, H.Pipeweaver: Addressing data dynamicity in large multimodal model training with dynamic interleaved pipeline. arXiv, 2025. arXiv:2504.14145

work page arXiv 2025
[70]

Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Chen, C., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, R., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., and Sun, M.Efficient GPT-4V level multimodal large language model for deployment on edge devices.Nature Communications 16, 1 (jul 2025), 5509

work page 2025
[71]

Y ao, Y., Yu, T., Zhang, A., W ang, C., Cui, J., Zhu, H., Cai, T., Chen, C., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, R., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., and Sun, M.Efficient GPT-4V level multimodal large language model for deployment on edge devices.Nature Communications 16, 5509 (2025)

work page 2025
[72]

InProceedings of Machine Learning and Systems(2020)

Yu, P., and Chowdhury, M.Salus: Fine-grained gpu sharing primi- tives for deep learning applications. InProceedings of Machine Learning and Systems(2020)

work page 2020
[73]

IEEE access 8(2020), 58443–58469

Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K.A survey of autonomous driving: Common practices and emerging technologies. IEEE access 8(2020), 58443–58469. [77]Z.ai. GLM-4.6. Technical blog, 2025

work page 2020
[74]

Zhang, D., Qi, S., Wu, Y., Xiao, X., W ang, X., and Chen, L.Fast-slow efficient training for multimodal large language models via visual token pruning, 2026

work page 2026
[75]

In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), USENIX Association, pp

Zhang, S., Xu, A., Chen, Q., Zhao, H., Cui, W., W ang, Z., Li, Y., Xiao, L., and Guo, M.Efficient Performance-Aware GPU sharing with compatibility and isolation through kernel space interception. In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), USENIX Association, pp. 1003–1019

work page 2025
[76]

Zhao, H., Zhu, F., Guo, H., Wang, M., Wang, R., Meng, G., and Zhang, Z.Mllm-cl: Continual learning for multimodal large language models, 2025

work page 2025
[77]

InProceedings of the 41st International Conference on Machine Learning(2024), vol

Zhen, H., Qiu, X., Chen, P., Yang, J., Y an, X., Du, Y., Hong, Y., and Gan, C.3D-VLA: A 3D vision-language-action generative world model. InProceedings of the 41st International Conference on Machine Learning(2024), vol. 235 ofProceedings of Machine Learning Research, PMLR, pp. 61229–61245

work page 2024
[78]

P., Gonzalez, J

Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., W ang, Y., Xu, Y., Zhuo, D., Xing, E. P., Gonzalez, J. E., and Stoica, I.Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(Carlsbad, CA, July 2022), USENIX Associa- tion, pp. 559–578

work page 2022
[79]

R., Salazar, G., Ryoo, M

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., W ahid, A., Vuong, Q., V anhoucke, V., Tran, H., Sori- cut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P. R., Salazar, G., Ryoo, M. S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W. E., Leal, I., Kua...

work page 2023

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

arXiv, 2022

Bai, J., et al.Ofasys: A multi-modal multi-task learning system for building generalist models. arXiv, 2022. arXiv:2212.04408

work page arXiv 2022

[3] [3]

Bai, S., Cai, Y., Chen, R., et al.Qwen3-vl technical report. arXiv,

work page

[4] [4]

ACM Manag

B¨other, M., Robroek, T., Gsteiger, V., Holzinger, R., Ma, X., T¨oz¨un, P., and Klimovic, A.Modyn: Data-centric machine learning pipeline orchestration.Proc. ACM Manag. Data 3, 1 (Feb. 2025)

work page 2025

[5] [5]

InProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp

Caron, M., Touvron, H., Misra, I., J´egou, H., Mairal, J., Bojanowski, P., and Joulin, A.Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp. 9650–9660

work page 2021

[6] [6]

Chen, W., Li, Z., and Xin, S.Omnivlm: A token-compressed, sub- billion-parameter vision-language model for efficient on-device infer- ence, 2024

work page 2024

[7] [7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., W ang, W., Cao, Y., et al.Expanding performance bound- aries of open-source multimodal models with model, data, and test- time scaling. arXiv, 2024. arXiv:2412.05271

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

InInternational Conference on Learning Representations(2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations(2021)

work page 2021

[9] [9]

R., and Smith, H.Applied Regression Analysis, 3 ed

Draper, N. R., and Smith, H.Applied Regression Analysis, 3 ed. John Wiley & Sons, 1998

work page 1998

[10] [10]

Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., V anhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., and Florence, P.PaLM-e: An embodied multimodal language model. InProceedings of the 40th In...

work page 2023

[11] [11]

InProceedings of the 41st International Conference on Machine Learning(2024), vol

Duan, J., Lu, R., Duanmu, H., Li, X., Zhang, X., Lin, D., Stoica, I., and Zhang, H.MuxServe: Flexible spatial-temporal multiplexing for multiple LLM serving. InProceedings of the 41st International Conference on Machine Learning(2024), vol. 235 ofProceedings of Machine Learning Research, PMLR, pp. 11905–11917

work page 2024

[12] [12]

In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), pp

Feng, W., Chen, Y., W ang, S., Peng, Y., Lin, H., and Yu, M.Optimus: Accelerating {Large-Scale} {Multi-Modal} {LLM} training by bubble exploitation. In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), pp. 161–177

work page 2025

[13] [13]

InInternational Conference on Learning Representations (2024), B

Garg, S., Farajtabar, M., Pouransari, H., Vemulapalli, R., Mehta, S., Tuzel, O., Shankar, V., and Faghri, F.Tic-clip: Continual training of clip models. InInternational Conference on Learning Representations (2024), B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, Eds., vol. 2024, pp. 16649–16684

work page 2024

[14] [14]

InProceedings of the 42nd International Conference on Machine Learning(13–19 Jul 2025), A

Ge, C., W ang, X., Zhang, Z., Chen, H., Fan, J., Huang, L., Xue, H., and Zhu, W.Dynamic mixture of curriculum LoRA experts for continual multimodal instruction tuning. InProceedings of the 42nd International Conference on Machine Learning(13–19 Jul 2025), A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Ed...

work page 2025

[15] [15]

Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities

Gemini Team, Google. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities. Technical report, 2025

work page 2025

[16] [16]

V., Joulin, A., and Misra, I.Imagebind: One embedding space to bind them all

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., and Misra, I.Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2023). [17]Google DeepMind. Gemini 3.1 Pro model card. Model card, 2026

work page 2023

[17] [17]

In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(2022), USENIX Association, pp

Han, M., Zhang, H., Chen, R., and Chen, H.Microsecond-scale preemption for concurrent gpu-accelerated dnn inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(2022), USENIX Association, pp. 539–558

work page 2022

[18] [18]

InProceedings of the 2025 USENIX 13 Wang et al

He, Y., and et al.LLMStation: Resource multiplexing in tuning and serving large language models. InProceedings of the 2025 USENIX 13 Wang et al. Annual Technical Conference(2025), USENIX Association

work page 2025

[19] [19]

M., and Porikli, F.Distilling multi-modal large language models for autonomous driving

Hegde, D., Y asarla, R., Cai, H., Han, S., Bhattacharyya, A., Maha- jan, S., Liu, L., Garrepalli, R., Patel, V. M., and Porikli, F.Distilling multi-modal large language models for autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(June 2025), pp. 27575–27585

work page 2025

[20] [20]

In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)(2024), pp

Huang, J., Zhang, Z., Zheng, S., Qin, F., and W ang, Y.{DISTMM}: Accelerating distributed multimodal model training. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)(2024), pp. 1157–1171

work page 2024

[21] [21]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M., Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., and Chen, Z.Gpipe: Effi- cient training of giant neural networks using pipeline parallelism. InAdvances in Neural Information Processing Systems (NeurIPS)(2019). arXiv:1811.06965

work page internal anchor Pith review Pith/arXiv arXiv 2019

[22] [22]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2024)

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., and Liu, Z.VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2024)

work page 2024

[23] [23]

Jang, I., Lu, R., Bansal, N., Chen, A., and Chowdhury, M.Efficient distributed MLLM training with Cornstarch, 2025

work page 2025

[24] [24]

R., Chen, T., and Jia, Z.Graphpipe: Improving performance and scalability of dnn training with graph pipeline parallelism

Jeon, B., Wu, M., Cao, S., Kim, S., Park, S., Aggarwal, N., Unger, C., Arfeen, D., Liao, P., Miao, X., Alizadeh, M., Ganger, G. R., Chen, T., and Jia, Z.Graphpipe: Improving performance and scalability of dnn training with graph pipeline parallelism. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages ...

work page 2025

[25] [25]

J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E. P., Sanketi, P. R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., and Finn, C.OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning(2025), vol. 270 ofProceeding...

work page 2025

[26] [26]

Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

Krell, M. M., Kosec, M., Perez, S. P., and Fitzgibbon, A.Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027(2022)

work page arXiv 2022

[27] [27]

Megrez-omni technical report, 2025

Li, B., Li, Y., Li, Z., Liu, C., Liu, W., Niu, G., Tan, Z., Xu, H., Yao, Z., Yuan, T., Zhou, D., Zhuang, Y., Yan, S., Dai, G., and Wang, Y. Megrez-omni technical report, 2025

work page 2025

[28] [28]

Li, S., Xue, F., Baranwal, C., Li, Y., and You, Y.Sequence paral- lelism: Long sequence training from system perspective.arXiv preprint arXiv:2105.13120(2021)

work page arXiv 2021

[29] [29]

Li, S., Zhao, Y., V arma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., V aughan, B., Damania, P., et al.Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2006

[30] [30]

Journal of Manufacturing Systems 85(2026), 531–556

Liu, C., Qian, Y., Tang, D., Zhu, H., Pang, J., and Cai, Q.From insight to action: Embodied multi-agent system integrating vision language model for digital twin-assisted human-robot collaborative assembly. Journal of Manufacturing Systems 85(2026), 531–556

work page 2026

[31] [31]

J.Visual instruction tuning.Ad- vances in neural information processing systems 36(2023), 34892–34916

Liu, H., Li, C., Wu, Q., and Lee, Y. J.Visual instruction tuning.Ad- vances in neural information processing systems 36(2023), 34892–34916

work page 2023

[32] [32]

InThe Twelfth International Conference on Learning Representations(2024)

Liu, H., Zaharia, M., and Abbeel, P.Ring attention with blockwise transformers for near-infinite context. InThe Twelfth International Conference on Learning Representations(2024)

work page 2024

[33] [33]

Liu, Z., Dong, Y., Wang, J., Liu, Z., Hu, W., Lu, J., and Rao, Y.Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328(2025)

work page arXiv 2025

[34] [34]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., and Kembhavi, A.Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

work page 2024

[35] [35]

InProceedings of the 32nd ACM International Conference on Multimedia(2024), pp

Lu, S., Guo, L., W ang, W., Zhao, Z., Yue, T., Liu, J., and Liu, S.Collab- orative training of tiny-large vision language models. InProceedings of the 32nd ACM International Conference on Multimedia(2024), pp. 4928– 4937

work page 2024

[36] [36]

Edge ai software market worth 8 .89 billion by 2031

MarketsandMarkets. Edge ai software market worth 8 .89 billion by 2031. Web Page, 2026

work page 2031

[37] [37]

Kimi K2.5: Visual agentic intelligence

Moonshot AI. Kimi K2.5: Visual agentic intelligence. Technical blog, 2026

work page 2026

[38] [38]

R., Ganger, G

Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Zaharia, M.Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)(2019)

work page 2019

[39] [39]

Multi-instance gpu user guide

NVIDIA. Multi-instance gpu user guide. NVIDIA Documentation,

work page

[40] [40]

Cuda c++ programming guide.https://docs.nvidia.com/cu da/cuda-c-programming-guide/, 2026

NVIDIA. Cuda c++ programming guide.https://docs.nvidia.com/cu da/cuda-c-programming-guide/, 2026

work page 2026

[41] [41]

Cuda c++ programming guide: Green contexts

NVIDIA Corporation. Cuda c++ programming guide: Green contexts. https://docs.nvidia.com/cuda/cuda-programming-guide/04-special- topics/green-contexts.html, 2025. Accessed: 2026-02-04

work page 2025

[42] [42]

Cuda multi-process service (mps) overview

NVIDIA Corporation. Cuda multi-process service (mps) overview. https://docs.nvidia.com/deploy/pdf/CUDA Multi Process Service Overview.pdf, 2025. Accessed: 2026-02-04

work page 2025

[43] [43]

NVIDIA Corporation, 2026

NVIDIA Corporation.NVIDIA Nsight Systems User Guide: GPU Metrics. NVIDIA Corporation, 2026. Version 2026.2. [45]OpenAI. Introducing GPT-5.4. OpenAI product release, 2026

work page 2026

[44] [44]

[47]Perron, L., and Furnon, V.Or-tools

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems 32(2019). [47]Perron, L., and Furnon, V.Or-tools. [48]Portes, J., Trott, A., Havens, S., King, D., Venigalla, A...

work page 2019

[45] [45]

Qwen3.5-397B-A17B model card

Qwen Team. Qwen3.5-397B-A17B model card. Hugging Face model card, 2026

work page 2026

[46] [46]

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I.Learning transferable visual models from natural lan- guage supervision. InProceedings of the 38th International Conference on Machine Learning(2021), M. Meila and T. Zhang, Eds., vol. 139 of Proceedings of ...

work page 2021

[47] [47]

Y., Awan, A

Rajbhandari, S., Li, C., Y ao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., and He, Y.DeepSpeed-MoE: Advancing mixture-of- experts inference and training to power next-generation AI scale. In Proceedings of the 39th International Conference on Machine Learning (2022), vol. 162 ofProceedings of Machine Learning Research, PMLR, pp. 18332–18346

work page 2022

[48] [48]

In Advances in Neural Information Processing Systems(2021), vol

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., and Hsieh, C.-J.Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems(2021), vol. 34

work page 2021

[49] [49]

Shao, Z., Yu, Z., Yu, J., Ouyang, X., Zheng, L., Gai, Z., Wang, M., Kuang, Z., and Ding, J.Imp: Highly capable large multimodal models for mobile devices.IEEE Transactions on Multimedia 27(2025), 2961– 2974

work page 2025

[50] [50]

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B.Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1909

[51] [51]

S., Shen, H., and Iyer, A.USHER: Holistic interference avoidance for resource optimized ML inference

Shubha, S. S., Shen, H., and Iyer, A.USHER: Holistic interference avoidance for resource optimized ML inference. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)(2024), USENIX Association, pp. 947–964. 14 Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing

work page 2024

[52] [52]

InComputer Vision – ECCV 2024 (2024), Springer, pp

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beiss- wenger, J., Luo, P., Geiger, A., and Li, H.DriveLM: Driving with graph visual question answering. InComputer Vision – ECCV 2024 (2024), Springer, pp. 256–274

work page 2024

[53] [53]

Siru, C., Yuanchao, S., Cong, W., and Jiming, C.A survey on edge multimodal large models: compression, inference acceleration, and applications.National Science Open

work page

[54] [54]

InProceedings of the Nineteenth European Conference on Computer Systems(2024), EuroSys ’24, Association for Computing Machinery, pp

Strati, F., Ma, X., and Klimovic, A.Orion: Interference-aware, fine-grained gpu sharing for ml applications. InProceedings of the Nineteenth European Conference on Computer Systems(2024), EuroSys ’24, Association for Computing Machinery, pp. 1075–1092

work page 2024

[55] [55]

Qwen3.5-Omni Technical Report

Team, Q.Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

InProceedings of the European Conference on Computer Vision(2020), pp

Teed, Z., and Deng, J.RAFT: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision(2020), pp. 402–419

work page 2020

[57] [57]

On-device multimodal ai market report 2026

The Business Research Company. On-device multimodal ai market report 2026. Market report, 2026

work page 2026

[58] [58]

InProceedings of The 8th Conference on Robot Learning(06–09 Nov 2025), P

Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., and Zhao, H.Drivevlm: The convergence of autonomous driving and large vision-language models. InProceedings of The 8th Conference on Robot Learning(06–09 Nov 2025), P. Agrawal, O. Kroe- mer, and W. Burgard, Eds., vol. 270 ofProceedings of Machine Learning Research, PMLR, p...

work page 2025

[59] [59]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., et al.Siglip 2: Multilingual vision-language encoders with improved semantic understanding. arXiv, 2025. arXiv:2502.14786

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

A practitioner's guide to real-world continual multimodal pretrain- ing

Udandarao, V., Roth, K., Dziadzio, S., Prabhu, A., Cherti, M., Vinyals, O., H´enaff, O., Albanie, S., Akata, Z., and Bethge, M. A practitioner's guide to real-world continual multimodal pretrain- ing. InAdvances in Neural Information Processing Systems(2024), A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37,...

work page 2024

[61] [61]

In2024 USENIX Annual Technical Conference (USENIX ATC 24)(Santa Clara, CA, July 2024), USENIX Association, pp

Um, T., Oh, B., Kang, M., Lee, W.-Y., Kim, G., Kim, D., Kim, Y., Muzza- mmil, M., and Jeon, M.Metis: Fast automatic distributed training on heterogeneous GPUs. In2024 USENIX Annual Technical Conference (USENIX ATC 24)(Santa Clara, CA, July 2024), USENIX Association, pp. 563–578

work page 2024

[62] [62]

InProceedings of the 21st European Conference on Computer Systems (2026), pp

Wang, Y., Wang, Y., Chen, C., Xue, C., Weng, Q., Chen, Y., Li, Z., Zhu, X., Y ang, Y., Chen, Q., et al.Suika: Efficient and high-quality re-scheduling of 3d-parallelized llm training jobs in shared clusters. InProceedings of the 21st European Conference on Computer Systems (2026), pp. 2002–2021

work page 2026

[63] [63]

arXiv, 2024

W ang, Y., Zhu, S., Fu, F., Miao, X., Zhang, J., Zhu, J., Hong, F., Li, Y., and Cui, B.Spindle: Efficient distributed training of multi-task large models via wavefront scheduling. arXiv, 2024. arXiv:2409.03365

work page arXiv 2024

[64] [64]

Wen, Z., Gao, Y., Li, W., He, C., and Zhang, L.Token pruning in multimodal large language models: Are we solving the right problem?,

work page

[65] [65]

Findings of ACL 2025

work page 2025

[66] [66]

P., Huang, W.-C., Li, Y., Fang, L., W ang, Z., and Yu, P

Wu, Y., Li, D., Chen, Y., Jiang, R., Zou, H. P., Huang, W.-C., Li, Y., Fang, L., W ang, Z., and Yu, P. S.Multi-agent autonomous driving systems with large language models: A survey of recent advances, resources, and future directions.Findings of the Association for Computational Linguistics: EMNLP 2025(2025)

work page 2025

[67] [67]

K., Li, Z., and Zhao, H.Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters 9, 10 (2024), 8186–8193

Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.-Y. K., Li, Z., and Zhao, H.Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters 9, 10 (2024), 8186–8193

work page 2024

[68] [68]

InPro- ceedings of the 21st European Conference on Computer Systems(2026), pp

Xue, C., Chen, Y., Jiang, J., Zheng, N., Feng, J., Chen, J., Zhao, S., Y an, S., Lin, Y., Shi, L., et al.Megascale-omni: A hyper-scale, workload- resilient system for multimodal llm training in production. InPro- ceedings of the 21st European Conference on Computer Systems(2026), pp. 675–692

work page 2026

[69] [69]

arXiv, 2025

Xue, Z., Hu, H., Chen, X., Jiang, Y., Song, Y., Mi, Z., Zhu, Y., Jiang, D., Xia, Y., and Chen, H.Pipeweaver: Addressing data dynamicity in large multimodal model training with dynamic interleaved pipeline. arXiv, 2025. arXiv:2504.14145

work page arXiv 2025

[70] [70]

Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Chen, C., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, R., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., and Sun, M.Efficient GPT-4V level multimodal large language model for deployment on edge devices.Nature Communications 16, 1 (jul 2025), 5509

work page 2025

[71] [71]

Y ao, Y., Yu, T., Zhang, A., W ang, C., Cui, J., Zhu, H., Cai, T., Chen, C., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, R., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., and Sun, M.Efficient GPT-4V level multimodal large language model for deployment on edge devices.Nature Communications 16, 5509 (2025)

work page 2025

[72] [72]

InProceedings of Machine Learning and Systems(2020)

Yu, P., and Chowdhury, M.Salus: Fine-grained gpu sharing primi- tives for deep learning applications. InProceedings of Machine Learning and Systems(2020)

work page 2020

[73] [73]

IEEE access 8(2020), 58443–58469

Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K.A survey of autonomous driving: Common practices and emerging technologies. IEEE access 8(2020), 58443–58469. [77]Z.ai. GLM-4.6. Technical blog, 2025

work page 2020

[74] [74]

Zhang, D., Qi, S., Wu, Y., Xiao, X., W ang, X., and Chen, L.Fast-slow efficient training for multimodal large language models via visual token pruning, 2026

work page 2026

[75] [75]

In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), USENIX Association, pp

Zhang, S., Xu, A., Chen, Q., Zhao, H., Cui, W., W ang, Z., Li, Y., Xiao, L., and Guo, M.Efficient Performance-Aware GPU sharing with compatibility and isolation through kernel space interception. In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), USENIX Association, pp. 1003–1019

work page 2025

[76] [76]

Zhao, H., Zhu, F., Guo, H., Wang, M., Wang, R., Meng, G., and Zhang, Z.Mllm-cl: Continual learning for multimodal large language models, 2025

work page 2025

[77] [77]

InProceedings of the 41st International Conference on Machine Learning(2024), vol

Zhen, H., Qiu, X., Chen, P., Yang, J., Y an, X., Du, Y., Hong, Y., and Gan, C.3D-VLA: A 3D vision-language-action generative world model. InProceedings of the 41st International Conference on Machine Learning(2024), vol. 235 ofProceedings of Machine Learning Research, PMLR, pp. 61229–61245

work page 2024

[78] [78]

P., Gonzalez, J

Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., W ang, Y., Xu, Y., Zhuo, D., Xing, E. P., Gonzalez, J. E., and Stoica, I.Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(Carlsbad, CA, July 2022), USENIX Associa- tion, pp. 559–578

work page 2022

[79] [79]

R., Salazar, G., Ryoo, M

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., W ahid, A., Vuong, Q., V anhoucke, V., Tran, H., Sori- cut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P. R., Salazar, G., Ryoo, M. S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W. E., Leal, I., Kua...

work page 2023