Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing
Pith reviewed 2026-05-20 07:49 UTC · model grok-4.3
The pith
Multimodal models train faster when multiple modules share each GPU under controlled resource shares.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Apollo deploys multimodal models with temporal-spatial multiplexing so that multiple modules colocate on a GPU under explicit resource quotas. A flexible execution engine supports arbitrary quotas, a performance model estimates execution time for each allocation, and heuristics use the model to produce high-quality deployment plans, yielding measured speedups up to 1.31x.
What carries the argument
The performance model that estimates module execution time under different resource allocation plans and feeds those estimates to heuristics that choose deployment plans.
If this is right
- Wall-clock training time for popular multimodal models drops on a fixed number of GPUs.
- Average GPU utilization rises because colocated modules fill idle capacity that a single module leaves unused.
- Deployment plans can be generated quickly without testing every possible quota combination.
- Resource shares can be tuned separately for each module type to balance heterogeneous workloads.
Where Pith is reading between the lines
- The same quota-based sharing could apply to inference serving where batch sizes and latency targets differ from training.
- Fewer total GPUs might suffice for a given training workload, reducing both monetary cost and energy consumption.
- Pairing the static performance model with runtime measurements could enable dynamic quota adjustments as training progresses.
Load-bearing premise
The performance model accurately predicts how execution time changes when resource shares are varied.
What would settle it
Running Apollo on a new multimodal model and finding that its chosen allocation plan produces no speedup or a slowdown compared with the standard one-module-per-GPU baseline.
Figures
read the original abstract
With the wide adoption of Multimodal Models (MMs) in real-world scenarios, it is significant to efficiently train emerging MMs that exhibit increasingly complex module architectures. For MM deployment, existing works allocate a GPU to only one MM module in a temporal-multiplexing manner; this compromises training efficiency because a single module often fails to achieve high GPU utilization. To improve GPU utilization and enable efficient MM training, we propose deploying MMs in a temporal-spatial multiplexing manner, allowing multiple MM modules to colocate on a GPU with well-controlled resource quotas. In this paper, we propose Apollo, an efficient MM training system that applies temporal-spatial multiplexing. We first develop a flexible and lightweight execution engine that supports MM training with arbitrary resource quotas, and then build a comprehensive and accurate performance model to estimate module execution time under different allocation plans. With the performance model, we further adopt effective heuristics to derive high-quality MM deployment plans efficiently. Testbed experiments confirm that Apollo effectively improves the training efficiency of popular MMs, with a training speedup of up to 1.31x.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Apollo, a training system for multimodal models (MMs) that shifts from temporal multiplexing (one module per GPU) to temporal-spatial multiplexing, allowing multiple modules to colocate on a GPU under controlled resource quotas. It introduces a flexible execution engine supporting arbitrary quotas, a performance model to predict module execution time for different allocation plans, heuristics that use the model to select deployment plans, and testbed results claiming up to 1.31x training speedup on popular MMs.
Significance. If the performance model is shown to be accurate for unseen colocation scenarios and the speedups are reproducible with proper controls, the work could meaningfully improve GPU utilization for training complex, module-heavy multimodal models. The practical systems contribution of a lightweight engine plus heuristic planning is relevant to the distributed systems and ML systems communities.
major comments (2)
- [Performance Model] Performance model (as described after the execution engine): the manuscript states the model is 'comprehensive and accurate' yet supplies no information on whether it is analytical or learned, which features or counters it uses, the fitting or training procedure, or measured prediction error (e.g., MAPE) on colocation workloads not seen during development. Because the heuristics directly consume the model's estimates to choose plans, systematic mis-prediction for novel allocations can produce deployments whose real runtime exceeds the temporal-multiplexing baseline, directly undermining the central 1.31x speedup claim.
- [Evaluation] Experimental results (abstract and evaluation section): the reported speedups lack any description of the exact baselines, the set of MMs and workloads chosen, whether error bars or multiple runs are reported, or how post-hoc tuning of quotas or heuristics was prevented. Without these controls the empirical support for the efficiency claim remains weak.
minor comments (2)
- Clarify whether the system name 'Apollo' and the paper title 'Mosaic' refer to the same artifact or whether one is a prior name.
- [Evaluation] Add a short table or paragraph listing the specific multimodal models, dataset sizes, and GPU types used in the testbed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity around the performance model and evaluation details. We address each major comment below and will revise the manuscript to incorporate the requested information.
read point-by-point responses
-
Referee: [Performance Model] Performance model (as described after the execution engine): the manuscript states the model is 'comprehensive and accurate' yet supplies no information on whether it is analytical or learned, which features or counters it uses, the fitting or training procedure, or measured prediction error (e.g., MAPE) on colocation workloads not seen during development. Because the heuristics directly consume the model's estimates to choose plans, systematic mis-prediction for novel allocations can produce deployments whose real runtime exceeds the temporal-multiplexing baseline, directly undermining the central 1.31x speedup claim.
Authors: We agree that the manuscript does not currently supply the requested details on the performance model. In the revised version, we will add a new subsection that specifies the model's construction (analytical or learned), the exact features and hardware counters employed, the profiling and fitting procedure, and quantitative accuracy results including MAPE measured on colocation scenarios held out from model development. This addition will allow readers to evaluate the model's suitability for guiding the heuristics and will directly address concerns about potential mis-prediction affecting the reported speedups. revision: yes
-
Referee: [Evaluation] Experimental results (abstract and evaluation section): the reported speedups lack any description of the exact baselines, the set of MMs and workloads chosen, whether error bars or multiple runs are reported, or how post-hoc tuning of quotas or heuristics was prevented. Without these controls the empirical support for the efficiency claim remains weak.
Authors: We concur that the evaluation section requires additional methodological details to strengthen reproducibility and support for the efficiency claims. In the revision, we will expand the evaluation to explicitly define the baselines (including the temporal-multiplexing configuration), enumerate the specific multimodal models and workloads used, report results across multiple independent runs with error bars, and describe the experimental protocol employed to prevent post-hoc tuning of quotas or heuristics. These changes will provide the necessary controls and transparency. revision: yes
Circularity Check
No significant circularity; claims rest on empirical testbed validation
full rationale
The paper outlines a systems proposal: a flexible execution engine for arbitrary GPU quotas, a performance model estimating module execution times under allocation plans, heuristics to select deployment plans, and testbed experiments measuring up to 1.31x speedup on popular multimodal models. No equations, analytical derivations, fitted parameters renamed as predictions, or self-citations are described in the abstract or provided text that would reduce any result to its inputs by construction. The speedup claim is grounded in direct measurements rather than model-based predictions that could be tautological, rendering the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Bai, J., et al.Ofasys: A multi-modal multi-task learning system for building generalist models. arXiv, 2022. arXiv:2212.04408
-
[3]
Bai, S., Cai, Y., Chen, R., et al.Qwen3-vl technical report. arXiv,
- [4]
-
[5]
InProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp
Caron, M., Touvron, H., Misra, I., J´egou, H., Mairal, J., Bojanowski, P., and Joulin, A.Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp. 9650–9660
work page 2021
-
[6]
Chen, W., Li, Z., and Xin, S.Omnivlm: A token-compressed, sub- billion-parameter vision-language model for efficient on-device infer- ence, 2024
work page 2024
-
[7]
Chen, Z., W ang, W., Cao, Y., et al.Expanding performance bound- aries of open-source multimodal models with model, data, and test- time scaling. arXiv, 2024. arXiv:2412.05271
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
InInternational Conference on Learning Representations(2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations(2021)
work page 2021
-
[9]
R., and Smith, H.Applied Regression Analysis, 3 ed
Draper, N. R., and Smith, H.Applied Regression Analysis, 3 ed. John Wiley & Sons, 1998
work page 1998
-
[10]
Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., V anhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., and Florence, P.PaLM-e: An embodied multimodal language model. InProceedings of the 40th In...
work page 2023
-
[11]
InProceedings of the 41st International Conference on Machine Learning(2024), vol
Duan, J., Lu, R., Duanmu, H., Li, X., Zhang, X., Lin, D., Stoica, I., and Zhang, H.MuxServe: Flexible spatial-temporal multiplexing for multiple LLM serving. InProceedings of the 41st International Conference on Machine Learning(2024), vol. 235 ofProceedings of Machine Learning Research, PMLR, pp. 11905–11917
work page 2024
-
[12]
In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), pp
Feng, W., Chen, Y., W ang, S., Peng, Y., Lin, H., and Yu, M.Optimus: Accelerating {Large-Scale} {Multi-Modal} {LLM} training by bubble exploitation. In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), pp. 161–177
work page 2025
-
[13]
InInternational Conference on Learning Representations (2024), B
Garg, S., Farajtabar, M., Pouransari, H., Vemulapalli, R., Mehta, S., Tuzel, O., Shankar, V., and Faghri, F.Tic-clip: Continual training of clip models. InInternational Conference on Learning Representations (2024), B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, Eds., vol. 2024, pp. 16649–16684
work page 2024
-
[14]
InProceedings of the 42nd International Conference on Machine Learning(13–19 Jul 2025), A
Ge, C., W ang, X., Zhang, Z., Chen, H., Fan, J., Huang, L., Xue, H., and Zhu, W.Dynamic mixture of curriculum LoRA experts for continual multimodal instruction tuning. InProceedings of the 42nd International Conference on Machine Learning(13–19 Jul 2025), A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Ed...
work page 2025
-
[15]
Gemini Team, Google. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities. Technical report, 2025
work page 2025
-
[16]
V., Joulin, A., and Misra, I.Imagebind: One embedding space to bind them all
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., and Misra, I.Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2023). [17]Google DeepMind. Gemini 3.1 Pro model card. Model card, 2026
work page 2023
-
[17]
Han, M., Zhang, H., Chen, R., and Chen, H.Microsecond-scale preemption for concurrent gpu-accelerated dnn inferences. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(2022), USENIX Association, pp. 539–558
work page 2022
-
[18]
InProceedings of the 2025 USENIX 13 Wang et al
He, Y., and et al.LLMStation: Resource multiplexing in tuning and serving large language models. InProceedings of the 2025 USENIX 13 Wang et al. Annual Technical Conference(2025), USENIX Association
work page 2025
-
[19]
M., and Porikli, F.Distilling multi-modal large language models for autonomous driving
Hegde, D., Y asarla, R., Cai, H., Han, S., Bhattacharyya, A., Maha- jan, S., Liu, L., Garrepalli, R., Patel, V. M., and Porikli, F.Distilling multi-modal large language models for autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(June 2025), pp. 27575–27585
work page 2025
-
[20]
In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)(2024), pp
Huang, J., Zhang, Z., Zheng, S., Qin, F., and W ang, Y.{DISTMM}: Accelerating distributed multimodal model training. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)(2024), pp. 1157–1171
work page 2024
-
[21]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M., Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., and Chen, Z.Gpipe: Effi- cient training of giant neural networks using pipeline parallelism. InAdvances in Neural Information Processing Systems (NeurIPS)(2019). arXiv:1811.06965
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[22]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2024)
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., and Liu, Z.VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2024)
work page 2024
-
[23]
Jang, I., Lu, R., Bansal, N., Chen, A., and Chowdhury, M.Efficient distributed MLLM training with Cornstarch, 2025
work page 2025
-
[24]
Jeon, B., Wu, M., Cao, S., Kim, S., Park, S., Aggarwal, N., Unger, C., Arfeen, D., Liao, P., Miao, X., Alizadeh, M., Ganger, G. R., Chen, T., and Jia, Z.Graphpipe: Improving performance and scalability of dnn training with graph pipeline parallelism. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages ...
work page 2025
-
[25]
J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E
Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E. P., Sanketi, P. R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., and Finn, C.OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning(2025), vol. 270 ofProceeding...
work page 2025
-
[26]
Krell, M. M., Kosec, M., Perez, S. P., and Fitzgibbon, A.Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027(2022)
-
[27]
Megrez-omni technical report, 2025
Li, B., Li, Y., Li, Z., Liu, C., Liu, W., Niu, G., Tan, Z., Xu, H., Yao, Z., Yuan, T., Zhou, D., Zhuang, Y., Yan, S., Dai, G., and Wang, Y. Megrez-omni technical report, 2025
work page 2025
- [28]
-
[29]
Li, S., Zhao, Y., V arma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., V aughan, B., Damania, P., et al.Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[30]
Journal of Manufacturing Systems 85(2026), 531–556
Liu, C., Qian, Y., Tang, D., Zhu, H., Pang, J., and Cai, Q.From insight to action: Embodied multi-agent system integrating vision language model for digital twin-assisted human-robot collaborative assembly. Journal of Manufacturing Systems 85(2026), 531–556
work page 2026
-
[31]
Liu, H., Li, C., Wu, Q., and Lee, Y. J.Visual instruction tuning.Ad- vances in neural information processing systems 36(2023), 34892–34916
work page 2023
-
[32]
InThe Twelfth International Conference on Learning Representations(2024)
Liu, H., Zaharia, M., and Abbeel, P.Ring attention with blockwise transformers for near-infinite context. InThe Twelfth International Conference on Learning Representations(2024)
work page 2024
- [33]
-
[34]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., and Kembhavi, A.Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
work page 2024
-
[35]
InProceedings of the 32nd ACM International Conference on Multimedia(2024), pp
Lu, S., Guo, L., W ang, W., Zhao, Z., Yue, T., Liu, J., and Liu, S.Collab- orative training of tiny-large vision language models. InProceedings of the 32nd ACM International Conference on Multimedia(2024), pp. 4928– 4937
work page 2024
-
[36]
Edge ai software market worth 8 .89 billion by 2031
MarketsandMarkets. Edge ai software market worth 8 .89 billion by 2031. Web Page, 2026
work page 2031
-
[37]
Kimi K2.5: Visual agentic intelligence
Moonshot AI. Kimi K2.5: Visual agentic intelligence. Technical blog, 2026
work page 2026
-
[38]
Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Zaharia, M.Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)(2019)
work page 2019
-
[39]
NVIDIA. Multi-instance gpu user guide. NVIDIA Documentation,
-
[40]
Cuda c++ programming guide.https://docs.nvidia.com/cu da/cuda-c-programming-guide/, 2026
NVIDIA. Cuda c++ programming guide.https://docs.nvidia.com/cu da/cuda-c-programming-guide/, 2026
work page 2026
-
[41]
Cuda c++ programming guide: Green contexts
NVIDIA Corporation. Cuda c++ programming guide: Green contexts. https://docs.nvidia.com/cuda/cuda-programming-guide/04-special- topics/green-contexts.html, 2025. Accessed: 2026-02-04
work page 2025
-
[42]
Cuda multi-process service (mps) overview
NVIDIA Corporation. Cuda multi-process service (mps) overview. https://docs.nvidia.com/deploy/pdf/CUDA Multi Process Service Overview.pdf, 2025. Accessed: 2026-02-04
work page 2025
-
[43]
NVIDIA Corporation.NVIDIA Nsight Systems User Guide: GPU Metrics. NVIDIA Corporation, 2026. Version 2026.2. [45]OpenAI. Introducing GPT-5.4. OpenAI product release, 2026
work page 2026
-
[44]
[47]Perron, L., and Furnon, V.Or-tools
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems 32(2019). [47]Perron, L., and Furnon, V.Or-tools. [48]Portes, J., Trott, A., Havens, S., King, D., Venigalla, A...
work page 2019
-
[45]
Qwen Team. Qwen3.5-397B-A17B model card. Hugging Face model card, 2026
work page 2026
-
[46]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I.Learning transferable visual models from natural lan- guage supervision. InProceedings of the 38th International Conference on Machine Learning(2021), M. Meila and T. Zhang, Eds., vol. 139 of Proceedings of ...
work page 2021
-
[47]
Rajbhandari, S., Li, C., Y ao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., and He, Y.DeepSpeed-MoE: Advancing mixture-of- experts inference and training to power next-generation AI scale. In Proceedings of the 39th International Conference on Machine Learning (2022), vol. 162 ofProceedings of Machine Learning Research, PMLR, pp. 18332–18346
work page 2022
-
[48]
In Advances in Neural Information Processing Systems(2021), vol
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., and Hsieh, C.-J.Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems(2021), vol. 34
work page 2021
-
[49]
Shao, Z., Yu, Z., Yu, J., Ouyang, X., Zheng, L., Gai, Z., Wang, M., Kuang, Z., and Ding, J.Imp: Highly capable large multimodal models for mobile devices.IEEE Transactions on Multimedia 27(2025), 2961– 2974
work page 2025
-
[50]
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B.Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[51]
S., Shen, H., and Iyer, A.USHER: Holistic interference avoidance for resource optimized ML inference
Shubha, S. S., Shen, H., and Iyer, A.USHER: Holistic interference avoidance for resource optimized ML inference. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)(2024), USENIX Association, pp. 947–964. 14 Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing
work page 2024
-
[52]
InComputer Vision – ECCV 2024 (2024), Springer, pp
Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beiss- wenger, J., Luo, P., Geiger, A., and Li, H.DriveLM: Driving with graph visual question answering. InComputer Vision – ECCV 2024 (2024), Springer, pp. 256–274
work page 2024
-
[53]
Siru, C., Yuanchao, S., Cong, W., and Jiming, C.A survey on edge multimodal large models: compression, inference acceleration, and applications.National Science Open
-
[54]
Strati, F., Ma, X., and Klimovic, A.Orion: Interference-aware, fine-grained gpu sharing for ml applications. InProceedings of the Nineteenth European Conference on Computer Systems(2024), EuroSys ’24, Association for Computing Machinery, pp. 1075–1092
work page 2024
-
[55]
Team, Q.Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
InProceedings of the European Conference on Computer Vision(2020), pp
Teed, Z., and Deng, J.RAFT: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision(2020), pp. 402–419
work page 2020
-
[57]
On-device multimodal ai market report 2026
The Business Research Company. On-device multimodal ai market report 2026. Market report, 2026
work page 2026
-
[58]
InProceedings of The 8th Conference on Robot Learning(06–09 Nov 2025), P
Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., and Zhao, H.Drivevlm: The convergence of autonomous driving and large vision-language models. InProceedings of The 8th Conference on Robot Learning(06–09 Nov 2025), P. Agrawal, O. Kroe- mer, and W. Burgard, Eds., vol. 270 ofProceedings of Machine Learning Research, PMLR, p...
work page 2025
-
[59]
Tschannen, M., et al.Siglip 2: Multilingual vision-language encoders with improved semantic understanding. arXiv, 2025. arXiv:2502.14786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
A practitioner's guide to real-world continual multimodal pretrain- ing
Udandarao, V., Roth, K., Dziadzio, S., Prabhu, A., Cherti, M., Vinyals, O., H´enaff, O., Albanie, S., Akata, Z., and Bethge, M. A practitioner's guide to real-world continual multimodal pretrain- ing. InAdvances in Neural Information Processing Systems(2024), A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37,...
work page 2024
-
[61]
Um, T., Oh, B., Kang, M., Lee, W.-Y., Kim, G., Kim, D., Kim, Y., Muzza- mmil, M., and Jeon, M.Metis: Fast automatic distributed training on heterogeneous GPUs. In2024 USENIX Annual Technical Conference (USENIX ATC 24)(Santa Clara, CA, July 2024), USENIX Association, pp. 563–578
work page 2024
-
[62]
InProceedings of the 21st European Conference on Computer Systems (2026), pp
Wang, Y., Wang, Y., Chen, C., Xue, C., Weng, Q., Chen, Y., Li, Z., Zhu, X., Y ang, Y., Chen, Q., et al.Suika: Efficient and high-quality re-scheduling of 3d-parallelized llm training jobs in shared clusters. InProceedings of the 21st European Conference on Computer Systems (2026), pp. 2002–2021
work page 2026
-
[63]
W ang, Y., Zhu, S., Fu, F., Miao, X., Zhang, J., Zhu, J., Hong, F., Li, Y., and Cui, B.Spindle: Efficient distributed training of multi-task large models via wavefront scheduling. arXiv, 2024. arXiv:2409.03365
-
[64]
Wen, Z., Gao, Y., Li, W., He, C., and Zhang, L.Token pruning in multimodal large language models: Are we solving the right problem?,
-
[65]
Findings of ACL 2025
work page 2025
-
[66]
P., Huang, W.-C., Li, Y., Fang, L., W ang, Z., and Yu, P
Wu, Y., Li, D., Chen, Y., Jiang, R., Zou, H. P., Huang, W.-C., Li, Y., Fang, L., W ang, Z., and Yu, P. S.Multi-agent autonomous driving systems with large language models: A survey of recent advances, resources, and future directions.Findings of the Association for Computational Linguistics: EMNLP 2025(2025)
work page 2025
-
[67]
Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.-Y. K., Li, Z., and Zhao, H.Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters 9, 10 (2024), 8186–8193
work page 2024
-
[68]
InPro- ceedings of the 21st European Conference on Computer Systems(2026), pp
Xue, C., Chen, Y., Jiang, J., Zheng, N., Feng, J., Chen, J., Zhao, S., Y an, S., Lin, Y., Shi, L., et al.Megascale-omni: A hyper-scale, workload- resilient system for multimodal llm training in production. InPro- ceedings of the 21st European Conference on Computer Systems(2026), pp. 675–692
work page 2026
-
[69]
Xue, Z., Hu, H., Chen, X., Jiang, Y., Song, Y., Mi, Z., Zhu, Y., Jiang, D., Xia, Y., and Chen, H.Pipeweaver: Addressing data dynamicity in large multimodal model training with dynamic interleaved pipeline. arXiv, 2025. arXiv:2504.14145
-
[70]
Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Chen, C., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, R., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., and Sun, M.Efficient GPT-4V level multimodal large language model for deployment on edge devices.Nature Communications 16, 1 (jul 2025), 5509
work page 2025
-
[71]
Y ao, Y., Yu, T., Zhang, A., W ang, C., Cui, J., Zhu, H., Cai, T., Chen, C., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, R., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., and Sun, M.Efficient GPT-4V level multimodal large language model for deployment on edge devices.Nature Communications 16, 5509 (2025)
work page 2025
-
[72]
InProceedings of Machine Learning and Systems(2020)
Yu, P., and Chowdhury, M.Salus: Fine-grained gpu sharing primi- tives for deep learning applications. InProceedings of Machine Learning and Systems(2020)
work page 2020
-
[73]
IEEE access 8(2020), 58443–58469
Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K.A survey of autonomous driving: Common practices and emerging technologies. IEEE access 8(2020), 58443–58469. [77]Z.ai. GLM-4.6. Technical blog, 2025
work page 2020
-
[74]
Zhang, D., Qi, S., Wu, Y., Xiao, X., W ang, X., and Chen, L.Fast-slow efficient training for multimodal large language models via visual token pruning, 2026
work page 2026
-
[75]
In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), USENIX Association, pp
Zhang, S., Xu, A., Chen, Q., Zhao, H., Cui, W., W ang, Z., Li, Y., Xiao, L., and Guo, M.Efficient Performance-Aware GPU sharing with compatibility and isolation through kernel space interception. In2025 USENIX Annual Technical Conference (USENIX ATC 25)(2025), USENIX Association, pp. 1003–1019
work page 2025
-
[76]
Zhao, H., Zhu, F., Guo, H., Wang, M., Wang, R., Meng, G., and Zhang, Z.Mllm-cl: Continual learning for multimodal large language models, 2025
work page 2025
-
[77]
InProceedings of the 41st International Conference on Machine Learning(2024), vol
Zhen, H., Qiu, X., Chen, P., Yang, J., Y an, X., Du, Y., Hong, Y., and Gan, C.3D-VLA: A 3D vision-language-action generative world model. InProceedings of the 41st International Conference on Machine Learning(2024), vol. 235 ofProceedings of Machine Learning Research, PMLR, pp. 61229–61245
work page 2024
-
[78]
Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., W ang, Y., Xu, Y., Zhuo, D., Xing, E. P., Gonzalez, J. E., and Stoica, I.Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)(Carlsbad, CA, July 2022), USENIX Associa- tion, pp. 559–578
work page 2022
-
[79]
Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., W ahid, A., Vuong, Q., V anhoucke, V., Tran, H., Sori- cut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P. R., Salazar, G., Ryoo, M. S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W. E., Leal, I., Kua...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.