Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

Fandong Meng; Hao Zhou; Jie Zhou; Lean Wang; Lei Li; Xu Sun; Yuanxin Liu; Zhiyu Xu

arxiv: 2605.19523 · v1 · pith:GW7UCTK5new · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.CV

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

Zhiyu Xu , Lean Wang , Yuanxin Liu , Lei Li , Hao Zhou , Fandong Meng , Jie Zhou , Xu Sun This is my paper

Pith reviewed 2026-05-20 06:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords model mergingvision-language modelscross-modal transferskill injectioninstruction followingmathematical reasoninghyperparameter tuning

0 comments

The pith

Cross-modal skill injection from LLMs to VLMs succeeds in instruction-following and cross-lingual tasks but struggles with mathematical reasoning, with TA and DARE outperforming other merging methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies an efficient way to add specialized skills to Vision-Language Models by merging them with expert Large Language Models instead of running expensive new training rounds. The investigation covers different use cases, merging techniques, and tuning choices, revealing that the approach transfers skills effectively for following instructions and working across languages. It also shows that some established merging techniques deliver stronger results than others and that choosing the right settings for those techniques matters a great deal.

Core claim

Integrating a domain-expert LLM into a VLM through model merging induces emergent cross-modal capabilities without extra training data or large compute costs. The method works well for instruction-following and cross-lingual scenarios yet falls short on mathematical reasoning. Classic approaches such as TA and DARE achieve better performance than alternative merging methods, and these methods depend on careful hyperparameter selection.

What carries the argument

Cross-modal skill injection, the process of merging a domain-expert LLM into a VLM to transfer skills and create new cross-modal abilities.

If this is right

VLMs can gain instruction-following improvements through LLM merging without new data collection.
Cross-lingual capabilities transfer reliably when the source LLM contains the relevant language skills.
Mathematical reasoning shows limited benefit from the same merging process.
TA and DARE merging methods produce stronger results than other tested techniques.
Success with these methods requires systematic tuning of their specific hyperparameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model merging may offer a practical route for keeping VLMs current with new domains as they appear.
The weaker math results point to a possible need for different merging strategies or additional alignment steps for reasoning-heavy tasks.
Teams could adopt TA or DARE first and then run targeted hyperparameter searches to upgrade existing VLMs quickly.
The method might extend to other modalities if similar domain-expert models become available.

Load-bearing premise

Merging a domain-expert LLM into a VLM can induce emergent cross-modal capabilities without requiring additional training data or significant computational overhead.

What would settle it

Running math-reasoning benchmarks on a VLM after merging it with a math-specialized LLM and finding no gain over the unmodified VLM.

Figures

Figures reproduced from arXiv: 2605.19523 by Fandong Meng, Hao Zhou, Jie Zhou, Lean Wang, Lei Li, Xu Sun, Yuanxin Liu, Zhiyu Xu.

**Figure 2.** Figure 2: Distribution of regret-over-random (RoR) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Regret-over-random (RoR) heatmap across optimization algorithms, scenarios and merging methods. Lower values (blue) indicate better performance. Instruction-following tasks show larger gaps between local and global methods. mance but exhibits high variance across runs. As a local search method that greedily descends along coordinate directions, it is prone to converging to local optima, and its sensitivit… view at source ↗

**Figure 4.** Figure 4: Normalized regret curves for validation and [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper investigates cross-modal skill injection, where domain-expert LLMs are merged into VLMs to transfer skills without extra training data or fine-tuning. It examines performance across scenarios, finding the approach effective for instruction-following and cross-lingual tasks but weak for mathematical reasoning; compares merging methods, concluding that classic approaches like TA and DARE outperform alternatives; and provides quantitative analysis of the hyperparameters these methods require.

Significance. If the empirical results hold, the work offers a practical, low-overhead alternative to SFT for updating VLMs with evolving domain skills. The scenario-specific findings and method comparisons could guide efficient multi-modal model maintenance, adding to the model-merging literature by extending it to cross-modal settings.

major comments (1)

The abstract and results summary report directional superiority of TA and DARE and scenario differences, but without visible details on dataset sizes, number of runs, variance, statistical tests, or full controls in the experimental sections, the robustness of these central empirical claims cannot be verified. This is load-bearing for the headline findings on methods and scenarios.

minor comments (2)

Define acronyms TA and DARE on first use in the abstract and introduction for clarity.
The hyperparameter analysis would benefit from additional tables or figures showing sensitivity ranges and optimal values across scenarios.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and valuable feedback on our manuscript. We agree that additional experimental details are necessary to substantiate the robustness of our central claims regarding method superiority and scenario-specific outcomes. We will revise the paper accordingly to address this concern.

read point-by-point responses

Referee: The abstract and results summary report directional superiority of TA and DARE and scenario differences, but without visible details on dataset sizes, number of runs, variance, statistical tests, or full controls in the experimental sections, the robustness of these central empirical claims cannot be verified. This is load-bearing for the headline findings on methods and scenarios.

Authors: We agree that the current manuscript lacks sufficient transparency in the experimental reporting, which is critical for verifying the reliability of our findings on TA/DARE outperforming other methods and the differential performance across scenarios. In the revised version, we will expand the 'Experiments' and 'Setup' sections with: (1) exact dataset sizes and sources for each scenario (instruction-following, cross-lingual, mathematical reasoning); (2) number of runs (we will report results averaged over 5 independent runs with different random seeds); (3) variance measures including standard deviations for all reported metrics; (4) statistical tests such as paired t-tests or Wilcoxon tests to assess significance of differences between methods; and (5) a complete description of experimental controls, including baseline models, hyperparameter search ranges, and evaluation protocols. These additions will directly support the headline empirical claims without altering the core results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical investigation with no derivation chain

full rationale

The paper is an empirical study reporting experimental outcomes on cross-modal skill injection across scenarios, methods, and hyperparameters. It makes no claims of first-principles derivations, mathematical proofs, or parameter-free predictions that could reduce to fitted inputs or self-referential definitions. All findings (e.g., performance differences by scenario and method superiority of TA/DARE) are grounded in direct experimental comparisons rather than any internal reduction to the paper's own assumptions or prior self-citations. The setup explicitly notes the absence of extra training data or overhead, and results are presented as observations from controlled tests, making the work self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the empirical validity of merging as a skill-transfer mechanism and on the representativeness of the tested scenarios and models; no new theoretical entities are introduced.

free parameters (1)

merging hyperparameters
The paper states that classic methods critically depend on hyperparameter tuning and provides quantitative analysis of these choices.

axioms (1)

domain assumption Model merging can transfer domain-specific expertise from LLMs to VLMs to induce cross-modal capabilities
Invoked in the definition and motivation of cross-modal skill injection as an efficient alternative to SFT.

pith-pipeline@v0.9.0 · 5769 in / 1200 out tokens · 42582 ms · 2026-05-20T06:04:17.566621+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

classic approaches such as TA and DARE consistently achieve superior performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022
[9]

2023 , url=

Tim Dettmers and Artidoro Pagnoni and Ari Holtzman and Luke Zettlemoyer , booktitle=. 2023 , url=

work page 2023
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[11]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023
[12]

Flamingo: a Visual Language Model for Few-Shot Learning , url =

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and Ring, Roman and Rutherford, Eliza and Cabi, Serkan and Han, Tengda and Gong, Zhitao and Samangooei, Sina and Monteiro, Marianne and Menick, Jacob L and Borgeaud, S...

work page
[13]

The Twelfth International Conference on Learning Representations , year=

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. The Twelfth International Conference on Learning Representations , year=

work page
[14]

2025 , url=

Yusu Qian and Hanrong Ye and Jean-Philippe Fauconnier and Peter Grasch and Yinfei Yang and Zhe Gan , booktitle=. 2025 , url=

work page 2025
[15]

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VIII , pages =

Zhang, Renrui and Jiang, Dongzhi and Zhang, Yichi and Lin, Haokun and Guo, Ziyu and Qiu, Pengshuo and Zhou, Aojun and Lu, Pan and Chang, Kai-Wei and Qiao, Yu and Gao, Peng and Li, Hongsheng , title =. Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VIII , pages =. 2024 , isbn =. doi:10.1...

work page doi:10.1007/978-3-031-73242-3_10 2024
[16]

JMMMU : A J apanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

Onohara, Shota and Miyai, Atsuyuki and Imajuku, Yuki and Egashira, Kazuki and Baek, Jeonghun and Yue, Xiang and Neubig, Graham and Aizawa, Kiyoharu. JMMMU : A J apanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for C...

work page doi:10.18653/v1/2025.naacl-long.43 2025
[17]

2024 , eprint=

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark , author=. 2024 , eprint=

work page 2024
[18]

2024 , eprint=

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences , author=. 2024 , eprint=

work page 2024
[19]

2024 , eprint=

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities , author=. 2024 , eprint=

work page 2024
[20]

Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

Matena, Michael and Raffel, Colin , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

work page 2022
[21]

The Eleventh International Conference on Learning Representations , year=

Dataless Knowledge Fusion by Merging Weights of Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[22]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Yu, Le and Yu, Bowen and Yu, Haiyang and Huang, Fei and Li, Yongbin , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[23]

2023 , url=

Prateek Yadav and Derek Tam and Leshem Choshen and Colin Raffel and Mohit Bansal , booktitle=. 2023 , url=

work page 2023
[24]

The Eleventh International Conference on Learning Representations , year=

Editing models with task arithmetic , author=. The Eleventh International Conference on Learning Representations , year=

work page
[25]

M eta GPT : Merging Large Language Models Using Model Exclusive Task Arithmetic

Zhou, Yuyan and Song, Liang and Wang, Bingning and Chen, Weipeng. M eta GPT : Merging Large Language Models Using Model Exclusive Task Arithmetic. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.102

work page doi:10.18653/v1/2024.emnlp-main.102 2024
[26]

2025 , eprint=

NAN: A Training-Free Solution to Coefficient Estimation in Model Merging , author=. 2025 , eprint=

work page 2025
[27]

Forty-second International Conference on Machine Learning , year=

Whoever Started the interference Should End It: Guiding Data-Free Model Merging via Task Vectors , author=. Forty-second International Conference on Machine Learning , year=

work page
[28]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Task Singular Vectors: Reducing Task Interference in Model Merging , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2025
[29]

The Eleventh International Conference on Learning Representations , year=

Git Re-Basin: Merging Models modulo Permutation Symmetries , author=. The Eleventh International Conference on Learning Representations , year=

work page
[30]

ArXiv , year=

ZipIt! Merging Models from Different Tasks without Training , author=. ArXiv , year=

work page
[31]

ICML 2024 Workshop on Foundation Models in the Wild , year=

Model Breadcrumbs: Scalable Upcycling of Finetuned Foundation Models via Sparse Task Vectors Merging , author=. ICML 2024 Workshop on Foundation Models in the Wild , year=

work page 2024
[32]

2024 , eprint=

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling , author=. 2024 , eprint=

work page 2024
[33]

Arcee ' s M erge K it: A Toolkit for Merging Large Language Models

Goddard, Charles and Siriwardhana, Shamane and Ehghaghi, Malikeh and Meyers, Luke and Karpukhin, Vladimir and Benedict, Brian and McQuade, Mark and Solawetz, Jacob. Arcee ' s M erge K it: A Toolkit for Merging Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.emnlp-industry.36 2024
[34]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

work page 2025
[35]

2024 , eprint=

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , author=. 2024 , eprint=

work page 2024
[36]

LMMs-Eval: Accelerating the Development of Large Multimoal Models , url=

Bo Li and Peiyuan Zhang and Kaichen Zhang and Fanyi Pu and Xinrun Du and Yuhao Dong and Haotian Liu and Yuanhan Zhang and Ge Zhang and Chunyuan Li and Ziwei Liu , publisher =. LMMs-Eval: Accelerating the Development of Large Multimoal Models , url=

work page
[37]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023
[38]

doi:10.57967/hf/2317 , publisher =

Wang, Shenzhi and Zheng, Yaowei and Wang, Guoyin and Song, Shiji and Huang, Gao , title =. doi:10.57967/hf/2317 , publisher =

work page doi:10.57967/hf/2317
[39]

elyza/Llama-3-ELYZA-JP-8B , url=

Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura and Daisuke Oba and Sam Passaglia and Akira Sasaki , year=. elyza/Llama-3-ELYZA-JP-8B , url=

work page
[40]

Forty-second International Conference on Machine Learning , year=

Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models , author=. Forty-second International Conference on Machine Learning , year=

work page
[41]

MA mmo TH - VL : Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Guo, Jiawei and Zheng, Tianyu and Li, Yizhi and Bai, Yuelin and Li, Bo and Wang, Yubo and Zhu, King and Neubig, Graham and Chen, Wenhu and Yue, Xiang. MA mmo TH - VL : Eliciting Multimodal Reasoning with Instruction Tuning at Scale. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:1...

work page doi:10.18653/v1/2025.acl-long.680 2025
[42]

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc , title =. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2021 , isbn =. doi:10.1145/3404835.3463257 , abstract =

work page doi:10.1145/3404835.3463257 2021
[43]

Extracting and Combining Abilities For Building Multi-lingual Ability-enhanced Large Language Models

Chen, Zhipeng and Zhou, Kun and Song, Liang and Zhao, Wayne Xin and Wang, Bingning and Chen, Weipeng and Wen, Ji-Rong. Extracting and Combining Abilities For Building Multi-lingual Ability-enhanced Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.887

work page doi:10.18653/v1/2025.emnlp-main.887 2025
[44]

ArXiv , year=

RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior , author=. ArXiv , year=

work page
[45]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

What matters when building vision-language models? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[46]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Visual Instruction Tuning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[47]

ArXiv , year=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. ArXiv , year=

work page
[48]

2024 , eprint=

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca , author=. 2024 , eprint=

work page 2024
[49]

2023 , note =

japanese\_alpaca\_data , author =. 2023 , note =

work page 2023
[50]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

work page
[51]

2023 , url =

Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin , title =. 2023 , url =

work page 2023
[52]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

work page
[53]

Improved Baselines with Visual Instruction Tuning , author=

work page
[54]

UQ-Merge: Uncertainty Guided Multimodal Large Language Model Merging

Qu, Huaizhi and Zhao, Xinyu and Peng, Jie and Lee, Kwonjoon and Dariush, Behzad and Chen, Tianlong. UQ-Merge: Uncertainty Guided Multimodal Large Language Model Merging. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.73

work page doi:10.18653/v1/2025.findings-acl.73 2025
[55]

2025 , url=

Didi Zhu and Yibing Song and Tao Shen and Ziyu Zhao and Jinluan Yang and Min Zhang and Chao Wu , booktitle=. 2025 , url=

work page 2025
[56]

ArXiv , year=

Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs , author=. ArXiv , year=

work page
[57]

CoRR , volume=

Yiyang Du and Xiaochen Wang and Chi Chen and Jiabo Ye and Yiru Wang and Peng Li and Ming Yan and Ji Zhang and Fei Huang and Zhifang Sui and Maosong Sun and Yang Liu , title=. CoRR , volume=. 2025 , month=

work page 2025
[58]

The Fourteenth International Conference on Learning Representations , year=

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[59]

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Li, Chen-An and Lin, Tzu-Han and Chen, Yun-Nung and Lee, Hung-yi. Transferring Textual Preferences to Vision-Language Understanding through Model Merging. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.72

work page doi:10.18653/v1/2025.acl-short.72 2025
[60]

SIAM review , volume=

Optimization by direct search: New perspectives on some classical and modern methods , author=. SIAM review , volume=. 2003 , publisher=

work page 2003
[61]

1967 , issn =

On the distribution of points in a cube and the approximate evaluation of integrals , journal =. 1967 , issn =. doi:https://doi.org/10.1016/0041-5553(67)90144-9 , url =

work page doi:10.1016/0041-5553(67)90144-9 1967
[62]

Evolutionary Computation , year=

Completely Derandomized Self-Adaptation in Evolution Strategies , author=. Evolutionary Computation , year=

work page
[63]

Journal of Global optimization , volume=

Efficient global optimization of expensive black-box functions , author=. Journal of Global optimization , volume=. 1998 , publisher=

work page 1998
[64]

SIAM Journal on optimization , volume=

On the convergence of pattern search algorithms , author=. SIAM Journal on optimization , volume=. 1997 , publisher=

work page 1997
[65]

The computer journal , volume=

An efficient method for finding the minimum of a function of several variables without calculating derivatives , author=. The computer journal , volume=. 1964 , publisher=

work page 1964
[66]

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[67]

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Shi, Wenhao and Hu, Zhiqiang and Bin, Yi and Liu, Junhua and Yang, Yang and Ng, See-Kiong and Bing, Lidong and Lee, Roy Ka-Wei. Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.268

work page doi:10.18653/v1/2024.findings-emnlp.268 2024
[68]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Sung, Yi-Lin and Cho, Jaemin and Bansal, Mohit , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

work page 2022
[69]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Publications Manual , year = "1983", publisher =

work page 1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[5] [5]

Dan Gusfield , title =. 1997

work page 1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[8] [8]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022

[9] [9]

2023 , url=

Tim Dettmers and Artidoro Pagnoni and Ari Holtzman and Luke Zettlemoyer , booktitle=. 2023 , url=

work page 2023

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024

[11] [11]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023

[12] [12]

Flamingo: a Visual Language Model for Few-Shot Learning , url =

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and Ring, Roman and Rutherford, Eliza and Cabi, Serkan and Han, Tengda and Gong, Zhitao and Samangooei, Sina and Monteiro, Marianne and Menick, Jacob L and Borgeaud, S...

work page

[13] [13]

The Twelfth International Conference on Learning Representations , year=

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. The Twelfth International Conference on Learning Representations , year=

work page

[14] [14]

2025 , url=

Yusu Qian and Hanrong Ye and Jean-Philippe Fauconnier and Peter Grasch and Yinfei Yang and Zhe Gan , booktitle=. 2025 , url=

work page 2025

[15] [15]

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VIII , pages =

Zhang, Renrui and Jiang, Dongzhi and Zhang, Yichi and Lin, Haokun and Guo, Ziyu and Qiu, Pengshuo and Zhou, Aojun and Lu, Pan and Chang, Kai-Wei and Qiao, Yu and Gao, Peng and Li, Hongsheng , title =. Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VIII , pages =. 2024 , isbn =. doi:10.1...

work page doi:10.1007/978-3-031-73242-3_10 2024

[16] [16]

JMMMU : A J apanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

Onohara, Shota and Miyai, Atsuyuki and Imajuku, Yuki and Egashira, Kazuki and Baek, Jeonghun and Yue, Xiang and Neubig, Graham and Aizawa, Kiyoharu. JMMMU : A J apanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for C...

work page doi:10.18653/v1/2025.naacl-long.43 2025

[17] [17]

2024 , eprint=

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark , author=. 2024 , eprint=

work page 2024

[18] [18]

2024 , eprint=

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences , author=. 2024 , eprint=

work page 2024

[19] [19]

2024 , eprint=

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities , author=. 2024 , eprint=

work page 2024

[20] [20]

Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

Matena, Michael and Raffel, Colin , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

work page 2022

[21] [21]

The Eleventh International Conference on Learning Representations , year=

Dataless Knowledge Fusion by Merging Weights of Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[22] [22]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Yu, Le and Yu, Bowen and Yu, Haiyang and Huang, Fei and Li, Yongbin , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024

[23] [23]

2023 , url=

Prateek Yadav and Derek Tam and Leshem Choshen and Colin Raffel and Mohit Bansal , booktitle=. 2023 , url=

work page 2023

[24] [24]

The Eleventh International Conference on Learning Representations , year=

Editing models with task arithmetic , author=. The Eleventh International Conference on Learning Representations , year=

work page

[25] [25]

M eta GPT : Merging Large Language Models Using Model Exclusive Task Arithmetic

Zhou, Yuyan and Song, Liang and Wang, Bingning and Chen, Weipeng. M eta GPT : Merging Large Language Models Using Model Exclusive Task Arithmetic. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.102

work page doi:10.18653/v1/2024.emnlp-main.102 2024

[26] [26]

2025 , eprint=

NAN: A Training-Free Solution to Coefficient Estimation in Model Merging , author=. 2025 , eprint=

work page 2025

[27] [27]

Forty-second International Conference on Machine Learning , year=

Whoever Started the interference Should End It: Guiding Data-Free Model Merging via Task Vectors , author=. Forty-second International Conference on Machine Learning , year=

work page

[28] [28]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Task Singular Vectors: Reducing Task Interference in Model Merging , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2025

[29] [29]

The Eleventh International Conference on Learning Representations , year=

Git Re-Basin: Merging Models modulo Permutation Symmetries , author=. The Eleventh International Conference on Learning Representations , year=

work page

[30] [30]

ArXiv , year=

ZipIt! Merging Models from Different Tasks without Training , author=. ArXiv , year=

work page

[31] [31]

ICML 2024 Workshop on Foundation Models in the Wild , year=

Model Breadcrumbs: Scalable Upcycling of Finetuned Foundation Models via Sparse Task Vectors Merging , author=. ICML 2024 Workshop on Foundation Models in the Wild , year=

work page 2024

[32] [32]

2024 , eprint=

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling , author=. 2024 , eprint=

work page 2024

[33] [33]

Arcee ' s M erge K it: A Toolkit for Merging Large Language Models

Goddard, Charles and Siriwardhana, Shamane and Ehghaghi, Malikeh and Meyers, Luke and Karpukhin, Vladimir and Benedict, Brian and McQuade, Mark and Solawetz, Jacob. Arcee ' s M erge K it: A Toolkit for Merging Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.emnlp-industry.36 2024

[34] [34]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

work page 2025

[35] [35]

2024 , eprint=

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , author=. 2024 , eprint=

work page 2024

[36] [36]

LMMs-Eval: Accelerating the Development of Large Multimoal Models , url=

Bo Li and Peiyuan Zhang and Kaichen Zhang and Fanyi Pu and Xinrun Du and Yuhao Dong and Haotian Liu and Yuanhan Zhang and Ge Zhang and Chunyuan Li and Ziwei Liu , publisher =. LMMs-Eval: Accelerating the Development of Large Multimoal Models , url=

work page

[37] [37]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023

[38] [38]

doi:10.57967/hf/2317 , publisher =

Wang, Shenzhi and Zheng, Yaowei and Wang, Guoyin and Song, Shiji and Huang, Gao , title =. doi:10.57967/hf/2317 , publisher =

work page doi:10.57967/hf/2317

[39] [39]

elyza/Llama-3-ELYZA-JP-8B , url=

Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura and Daisuke Oba and Sam Passaglia and Akira Sasaki , year=. elyza/Llama-3-ELYZA-JP-8B , url=

work page

[40] [40]

Forty-second International Conference on Machine Learning , year=

Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models , author=. Forty-second International Conference on Machine Learning , year=

work page

[41] [41]

MA mmo TH - VL : Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Guo, Jiawei and Zheng, Tianyu and Li, Yizhi and Bai, Yuelin and Li, Bo and Wang, Yubo and Zhu, King and Neubig, Graham and Chen, Wenhu and Yue, Xiang. MA mmo TH - VL : Eliciting Multimodal Reasoning with Instruction Tuning at Scale. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:1...

work page doi:10.18653/v1/2025.acl-long.680 2025

[42] [42]

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc , title =. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2021 , isbn =. doi:10.1145/3404835.3463257 , abstract =

work page doi:10.1145/3404835.3463257 2021

[43] [43]

Extracting and Combining Abilities For Building Multi-lingual Ability-enhanced Large Language Models

Chen, Zhipeng and Zhou, Kun and Song, Liang and Zhao, Wayne Xin and Wang, Bingning and Chen, Weipeng and Wen, Ji-Rong. Extracting and Combining Abilities For Building Multi-lingual Ability-enhanced Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.887

work page doi:10.18653/v1/2025.emnlp-main.887 2025

[44] [44]

ArXiv , year=

RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior , author=. ArXiv , year=

work page

[45] [45]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

What matters when building vision-language models? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[46] [46]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Visual Instruction Tuning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[47] [47]

ArXiv , year=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. ArXiv , year=

work page

[48] [48]

2024 , eprint=

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca , author=. 2024 , eprint=

work page 2024

[49] [49]

2023 , note =

japanese\_alpaca\_data , author =. 2023 , note =

work page 2023

[50] [50]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

work page

[51] [51]

2023 , url =

Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin , title =. 2023 , url =

work page 2023

[52] [52]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

work page

[53] [53]

Improved Baselines with Visual Instruction Tuning , author=

work page

[54] [54]

UQ-Merge: Uncertainty Guided Multimodal Large Language Model Merging

Qu, Huaizhi and Zhao, Xinyu and Peng, Jie and Lee, Kwonjoon and Dariush, Behzad and Chen, Tianlong. UQ-Merge: Uncertainty Guided Multimodal Large Language Model Merging. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.73

work page doi:10.18653/v1/2025.findings-acl.73 2025

[55] [55]

2025 , url=

Didi Zhu and Yibing Song and Tao Shen and Ziyu Zhao and Jinluan Yang and Min Zhang and Chao Wu , booktitle=. 2025 , url=

work page 2025

[56] [56]

ArXiv , year=

Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs , author=. ArXiv , year=

work page

[57] [57]

CoRR , volume=

Yiyang Du and Xiaochen Wang and Chi Chen and Jiabo Ye and Yiru Wang and Peng Li and Ming Yan and Ji Zhang and Fei Huang and Zhifang Sui and Maosong Sun and Yang Liu , title=. CoRR , volume=. 2025 , month=

work page 2025

[58] [58]

The Fourteenth International Conference on Learning Representations , year=

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models , author=. The Fourteenth International Conference on Learning Representations , year=

work page

[59] [59]

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Li, Chen-An and Lin, Tzu-Han and Chen, Yun-Nung and Lee, Hung-yi. Transferring Textual Preferences to Vision-Language Understanding through Model Merging. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.72

work page doi:10.18653/v1/2025.acl-short.72 2025

[60] [60]

SIAM review , volume=

Optimization by direct search: New perspectives on some classical and modern methods , author=. SIAM review , volume=. 2003 , publisher=

work page 2003

[61] [61]

1967 , issn =

On the distribution of points in a cube and the approximate evaluation of integrals , journal =. 1967 , issn =. doi:https://doi.org/10.1016/0041-5553(67)90144-9 , url =

work page doi:10.1016/0041-5553(67)90144-9 1967

[62] [62]

Evolutionary Computation , year=

Completely Derandomized Self-Adaptation in Evolution Strategies , author=. Evolutionary Computation , year=

work page

[63] [63]

Journal of Global optimization , volume=

Efficient global optimization of expensive black-box functions , author=. Journal of Global optimization , volume=. 1998 , publisher=

work page 1998

[64] [64]

SIAM Journal on optimization , volume=

On the convergence of pattern search algorithms , author=. SIAM Journal on optimization , volume=. 1997 , publisher=

work page 1997

[65] [65]

The computer journal , volume=

An efficient method for finding the minimum of a function of several variables without calculating derivatives , author=. The computer journal , volume=. 1964 , publisher=

work page 1964

[66] [66]

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page

[67] [67]

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Shi, Wenhao and Hu, Zhiqiang and Bin, Yi and Liu, Junhua and Yang, Yang and Ng, See-Kiong and Bing, Lidong and Lee, Roy Ka-Wei. Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.268

work page doi:10.18653/v1/2024.findings-emnlp.268 2024

[68] [68]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Sung, Yi-Lin and Cho, Jaemin and Bansal, Mohit , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

work page 2022

[69] [69]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv