AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
Pith reviewed 2026-05-22 22:43 UTC · model grok-4.3
The pith
AdaMMS merges heterogeneous multimodal LLMs by mapping architectures, interpolating weights, and selecting coefficients without labeled data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaMMS tackles merging of heterogeneous MLLMs through a mapping function that aligns differing architectures, followed by linear interpolation of weights to address parameter-space asymmetry, and an unsupervised hyper-parameter search that selects coefficients without requiring labeled data; this process yields merged models that outperform prior merging methods on various vision-language benchmarks.
What carries the argument
Three-step pipeline of architecture mapping, linear weight interpolation, and unsupervised coefficient search.
If this is right
- Merged models combine abilities from MLLMs that have incompatible architectures.
- The merging process requires no labeled task data for coefficient selection.
- Performance gains appear across multiple model combinations on vision-language tasks.
- Linear interpolation actively compensates for parameter asymmetry after mapping.
Where Pith is reading between the lines
- The method could reduce the need to train new multimodal models from scratch by allowing reuse of existing heterogeneous checkpoints.
- Extending the unsupervised search to merge more than two models at once would test scalability beyond pairwise combinations.
- If the mapping step generalizes, the same pipeline might apply to other modality pairs such as audio-language models.
Load-bearing premise
The mapping function and linear interpolation step can sufficiently resolve architectural differences and parameter-space asymmetry so that the subsequent unsupervised search produces useful merged models.
What would settle it
Running the full AdaMMS pipeline on a pair of heterogeneous MLLMs and finding that the merged model scores no higher than the stronger individual model on multiple vision-language benchmarks would falsify the central claim.
Figures
read the original abstract
Recently, model merging methods have demonstrated powerful strengths in combining abilities on various tasks from multiple Large Language Models (LLMs). While previous model merging methods mainly focus on merging homogeneous models with identical architecture, they meet challenges when dealing with Multimodal Large Language Models (MLLMs) with inherent heterogeneous property, including differences in model architecture and the asymmetry in the parameter space. In this work, we propose AdaMMS, a novel model merging method tailored for heterogeneous MLLMs. Our method tackles the challenges in three steps: mapping, merging and searching. Specifically, we first design mapping function between models to apply model merging on MLLMs with different architecture. Then we apply linear interpolation on model weights to actively adapt the asymmetry in the heterogeneous MLLMs. Finally in the hyper-parameter searching step, we propose an unsupervised hyper-parameter selection method for model merging. As the first model merging method capable of merging heterogeneous MLLMs without labeled data, extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AdaMMS, a three-step model merging method for heterogeneous Multimodal Large Language Models (MLLMs): (1) a mapping function to align models with differing architectures, (2) linear interpolation on weights to adapt parameter asymmetry, and (3) an unsupervised hyper-parameter search for coefficient optimization. It claims to be the first such method that merges heterogeneous MLLMs without labeled data and reports that extensive experiments on various model combinations show outperformance over prior merging methods on vision-language benchmarks.
Significance. If the claims hold after addressing the alignment and experimental details, the result would be significant as the first unsupervised merging approach for architecturally heterogeneous MLLMs, potentially allowing flexible combination of multimodal capabilities without task-specific labeled data or retraining. The unsupervised coefficient optimization is a positive feature that distinguishes it from supervised alternatives.
major comments (2)
- [Abstract] Abstract (three-step process): The mapping function is invoked to 'design mapping function between models to apply model merging on MLLMs with different architecture,' yet no concrete mechanism is specified for establishing layer correspondence, dimension projection, or handling mismatched components such as vision encoders and LLM backbones. This is load-bearing for the central claim, because without a valid alignment the linear interpolation produces weights outside the meaningful parameter space, rendering the unsupervised search unable to recover useful coefficients and undermining the reported outperformance.
- [Abstract] Abstract (experiments): The claim that 'extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks' is presented without any description of the model pairs, baselines, benchmarks, metrics, or statistical analysis. This prevents verification that gains are attributable to the mapping/interpolation/search components rather than implementation artifacts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. The comments highlight areas where the abstract could be more self-contained. We will revise the abstract in the next version to incorporate brief but concrete descriptions of the mapping mechanism and experimental setup, while ensuring the full details remain in the body of the paper. Below we address each major comment.
read point-by-point responses
-
Referee: [Abstract] Abstract (three-step process): The mapping function is invoked to 'design mapping function between models to apply model merging on MLLMs with different architecture,' yet no concrete mechanism is specified for establishing layer correspondence, dimension projection, or handling mismatched components such as vision encoders and LLM backbones. This is load-bearing for the central claim, because without a valid alignment the linear interpolation produces weights outside the meaningful parameter space, rendering the unsupervised search unable to recover useful coefficients and undermining the reported outperformance.
Authors: We agree the abstract is too terse on this point. Section 3.1 of the manuscript details the mapping function, which establishes layer correspondence by matching components with similar functional roles across architectures, applies dimension projection via linear transformations to align parameter spaces, and handles mismatched components (vision encoders and LLM backbones) by treating them as separate modules with independent mappings before merging. We will revise the abstract to include a concise clause summarizing these alignment steps so that the three-step process is clearer on first reading. revision: yes
-
Referee: [Abstract] Abstract (experiments): The claim that 'extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks' is presented without any description of the model pairs, baselines, benchmarks, metrics, or statistical analysis. This prevents verification that gains are attributable to the mapping/interpolation/search components rather than implementation artifacts.
Authors: We concur that the abstract's experimental claim would benefit from additional context. The full manuscript (Section 4) specifies the model pairs, baselines (including prior merging methods), benchmarks, metrics, and statistical analysis. We will revise the abstract to briefly note the scope of the experiments (e.g., multiple heterogeneous MLLM combinations evaluated on standard vision-language benchmarks) to improve verifiability without exceeding abstract length limits. revision: yes
Circularity Check
Empirical three-step procedure with no self-referential reductions or fitted predictions
full rationale
The paper describes an empirical method consisting of a mapping function, linear interpolation for parameter asymmetry, and unsupervised hyper-parameter search. No equations, derivations, or claims in the abstract reduce any output (e.g., merged model performance) to a quantity defined by the method itself or to a self-citation. The unsupervised search is presented as an optimization step over coefficients rather than a tautological prediction. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear. The central claim rests on experimental outperformance rather than any definitional equivalence, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Model composition for multimodal large language models
Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, and Yang Liu. Model composition for multimodal large language models. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
work page 2024
-
[2]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Amit Agarwal, Zhe Chen, Mo Li, Yubo Ma, Hailong Sun, Xiangyu Zhao, Junbo Cui, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024
work page 2024
-
[4]
Cl ´ementine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open- llm- leaderboard/open_llm_leaderboard, 2024
work page 2024
-
[5]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Arcee’s mergekit: A toolkit for merging large language models
Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Bene- dict, Mark McQuade, and Jacob Solawetz. Arcee’s mergekit: A toolkit for merging large language models. arXiv preprint arXiv:2403.13257, 2024
-
[7]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018
work page 2018
-
[8]
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. CogVLM2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019
work page 2019
-
[10]
Editing Models with Task Arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Seed-bench: Bench- marking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 13299–13308, 2024
work page 2024
-
[12]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
A survey on benchmarks of multimodal large language models
Jian Li and Weiheng Lu. A survey on benchmarks of multimodal large language models. arXiv preprint arXiv:2408.08632, 2024
-
[14]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024
work page 2024
-
[15]
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019
work page 2019
-
[17]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing. Association for Computational Linguis- tics, 2019
work page 2019
-
[18]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019
work page 2019
-
[19]
An empirical study of multimodal model merging
Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, and Lijuan Wang. An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933, 2023
-
[20]
Llama: Open and efficient foundation lan- guage models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timoth’ee Lacroix, Baptiste Rozi‘ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models
-
[21]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neu- ral Information Processing Systems. Curran Associates, Inc., 2017
work page 2017
-
[22]
Knowledge fusion of large language models, 2024
Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models, 2024
work page 2024
-
[23]
Knowledge fusion of chat llms: A preliminary technical report, 2024
Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, and Wei Bi. Knowledge fusion of chat llms: A preliminary technical report, 2024
work page 2024
-
[24]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
TIES-Merging: Resolving interfer- ence when merging models
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raf- fel, and Mohit Bansal. TIES-Merging: Resolving interfer- ence when merging models. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[27]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, M...
work page 2024
-
[28]
mplug- owi2: Revolutionizing multi-modal large language model with modality collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug- owi2: Revolutionizing multi-modal large language model with modality collaboration. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13040–13051. IEEE, 2024
work page 2024
-
[29]
Language models are super mario: Absorbing abilities from homologous models as a free lunch
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In Forty-first Interna- tional Conference on Machine Learning, 2024
work page 2024
-
[30]
MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9556– 9567, 2024
work page 2024
-
[31]
Lmms- eval: Reality check on the evaluation of large multimodal models, 2024
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024
work page 2024
-
[32]
MetaGPT: Merging large language models us- ing model exclusive task arithmetic
Yuyan Zhou, Liang Song, Bingning Wang, and Weipeng Chen. MetaGPT: Merging large language models us- ing model exclusive task arithmetic. arXiv preprint arXiv:2406.11385, 2024. A. Results on Additional Model Pairs We conducted experiments on additional model pairs, sum- marized in Table 9, which highlights the cumulative perfor- mance gains across tasks fo...
-
[33]
into CogVLM [25] (Table 2), (3) merging mPLUG- Owl2 into LLaV A-v1.5, (4) merging LLaV A-v1.5 into mPLUG-Owl2 [28] (Table 11), (5) merging CogVLM into mPLUG-Owl2 (Table 12) and (6) merging mPLUG-Owl2 into CogVLM (Table 13). The performance gain for each task is computed as the difference between the performance of our method (or baselines) and the average...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.