AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

Chi Chen; Fei Huang; Jiabo Ye; Ji Zhang; Maosong Sun; Ming Yan; Peng Li; Xiaochen Wang; Yang Liu; Yiru Wang

arxiv: 2503.23733 · v1 · submitted 2025-03-31 · 💻 cs.CL · cs.CV

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

Yiyang Du , Xiaochen Wang , Chi Chen , Jiabo Ye , Yiru Wang , Peng Li , Ming Yan , Ji Zhang

show 4 more authors

Fei Huang Zhifang Sui Maosong Sun Yang Liu

This is my paper

Pith reviewed 2026-05-22 22:43 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords model mergingheterogeneous MLLMsmultimodal large language modelsunsupervised coefficient optimizationvision-language benchmarksparameter interpolationarchitecture mapping

0 comments

The pith

AdaMMS merges heterogeneous multimodal LLMs by mapping architectures, interpolating weights, and selecting coefficients without labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AdaMMS as a method to combine capabilities from multimodal large language models that differ in architecture and parameter distributions. It proceeds in three steps: a mapping function aligns models with mismatched structures, linear interpolation adjusts for asymmetry in the weight space, and an unsupervised search optimizes the merging coefficients. A sympathetic reader would care because this removes the need for task-specific labels or identical model copies, allowing reuse of existing models to build stronger vision-language systems. The experiments test the approach across multiple model pairs and report gains over earlier merging techniques on standard benchmarks.

Core claim

AdaMMS tackles merging of heterogeneous MLLMs through a mapping function that aligns differing architectures, followed by linear interpolation of weights to address parameter-space asymmetry, and an unsupervised hyper-parameter search that selects coefficients without requiring labeled data; this process yields merged models that outperform prior merging methods on various vision-language benchmarks.

What carries the argument

Three-step pipeline of architecture mapping, linear weight interpolation, and unsupervised coefficient search.

If this is right

Merged models combine abilities from MLLMs that have incompatible architectures.
The merging process requires no labeled task data for coefficient selection.
Performance gains appear across multiple model combinations on vision-language tasks.
Linear interpolation actively compensates for parameter asymmetry after mapping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could reduce the need to train new multimodal models from scratch by allowing reuse of existing heterogeneous checkpoints.
Extending the unsupervised search to merge more than two models at once would test scalability beyond pairwise combinations.
If the mapping step generalizes, the same pipeline might apply to other modality pairs such as audio-language models.

Load-bearing premise

The mapping function and linear interpolation step can sufficiently resolve architectural differences and parameter-space asymmetry so that the subsequent unsupervised search produces useful merged models.

What would settle it

Running the full AdaMMS pipeline on a pair of heterogeneous MLLMs and finding that the merged model scores no higher than the stronger individual model on multiple vision-language benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2503.23733 by Chi Chen, Fei Huang, Jiabo Ye, Ji Zhang, Maosong Sun, Ming Yan, Peng Li, Xiaochen Wang, Yang Liu, Yiru Wang, Yiyang Du, Zhifang Sui.

**Figure 1.** Figure 1: (a) Illustration of three steps in AdaMMS: Step-1, mapping MLLMs with different model architecture; Step-2, merging MLLMs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Model responses with the change of α in linear interpolation. Similar colors indicate similar responses. that the model generates consistently near the parameters of the base model with small α, and collapses gradually with larger α. We attribute the phenomenon to the asymmetry in the parameter space, as the two original models have unequal status that comes from the choice of base architecture. 6.4. Sel… view at source ↗

**Figure 4.** Figure 4: Results on linear interpolation at different granularities of [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Generation consistency and model performance (score) for MME, MMMU, OCRBench and SeedBench when merging LLaVA [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Recently, model merging methods have demonstrated powerful strengths in combining abilities on various tasks from multiple Large Language Models (LLMs). While previous model merging methods mainly focus on merging homogeneous models with identical architecture, they meet challenges when dealing with Multimodal Large Language Models (MLLMs) with inherent heterogeneous property, including differences in model architecture and the asymmetry in the parameter space. In this work, we propose AdaMMS, a novel model merging method tailored for heterogeneous MLLMs. Our method tackles the challenges in three steps: mapping, merging and searching. Specifically, we first design mapping function between models to apply model merging on MLLMs with different architecture. Then we apply linear interpolation on model weights to actively adapt the asymmetry in the heterogeneous MLLMs. Finally in the hyper-parameter searching step, we propose an unsupervised hyper-parameter selection method for model merging. As the first model merging method capable of merging heterogeneous MLLMs without labeled data, extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaMMS sketches a three-step process for merging heterogeneous MLLMs without labels, but the mapping step is described too vaguely to judge whether it actually aligns the models.

read the letter

The paper's core idea is to merge MLLMs that differ in architecture and parameter layout by first mapping them, then applying linear interpolation to handle asymmetry, and finally running an unsupervised search for the merge coefficients. This extends earlier homogeneous merging work into a setting that matters for practical multimodal models, where people often want to combine off-the-shelf vision-language models without retraining or labeled data. The unsupervised coefficient step is a sensible direction if the earlier alignment holds.

Referee Report

2 major / 0 minor

Summary. The paper proposes AdaMMS, a three-step model merging method for heterogeneous Multimodal Large Language Models (MLLMs): (1) a mapping function to align models with differing architectures, (2) linear interpolation on weights to adapt parameter asymmetry, and (3) an unsupervised hyper-parameter search for coefficient optimization. It claims to be the first such method that merges heterogeneous MLLMs without labeled data and reports that extensive experiments on various model combinations show outperformance over prior merging methods on vision-language benchmarks.

Significance. If the claims hold after addressing the alignment and experimental details, the result would be significant as the first unsupervised merging approach for architecturally heterogeneous MLLMs, potentially allowing flexible combination of multimodal capabilities without task-specific labeled data or retraining. The unsupervised coefficient optimization is a positive feature that distinguishes it from supervised alternatives.

major comments (2)

[Abstract] Abstract (three-step process): The mapping function is invoked to 'design mapping function between models to apply model merging on MLLMs with different architecture,' yet no concrete mechanism is specified for establishing layer correspondence, dimension projection, or handling mismatched components such as vision encoders and LLM backbones. This is load-bearing for the central claim, because without a valid alignment the linear interpolation produces weights outside the meaningful parameter space, rendering the unsupervised search unable to recover useful coefficients and undermining the reported outperformance.
[Abstract] Abstract (experiments): The claim that 'extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks' is presented without any description of the model pairs, baselines, benchmarks, metrics, or statistical analysis. This prevents verification that gains are attributable to the mapping/interpolation/search components rather than implementation artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The comments highlight areas where the abstract could be more self-contained. We will revise the abstract in the next version to incorporate brief but concrete descriptions of the mapping mechanism and experimental setup, while ensuring the full details remain in the body of the paper. Below we address each major comment.

read point-by-point responses

Referee: [Abstract] Abstract (three-step process): The mapping function is invoked to 'design mapping function between models to apply model merging on MLLMs with different architecture,' yet no concrete mechanism is specified for establishing layer correspondence, dimension projection, or handling mismatched components such as vision encoders and LLM backbones. This is load-bearing for the central claim, because without a valid alignment the linear interpolation produces weights outside the meaningful parameter space, rendering the unsupervised search unable to recover useful coefficients and undermining the reported outperformance.

Authors: We agree the abstract is too terse on this point. Section 3.1 of the manuscript details the mapping function, which establishes layer correspondence by matching components with similar functional roles across architectures, applies dimension projection via linear transformations to align parameter spaces, and handles mismatched components (vision encoders and LLM backbones) by treating them as separate modules with independent mappings before merging. We will revise the abstract to include a concise clause summarizing these alignment steps so that the three-step process is clearer on first reading. revision: yes
Referee: [Abstract] Abstract (experiments): The claim that 'extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks' is presented without any description of the model pairs, baselines, benchmarks, metrics, or statistical analysis. This prevents verification that gains are attributable to the mapping/interpolation/search components rather than implementation artifacts.

Authors: We concur that the abstract's experimental claim would benefit from additional context. The full manuscript (Section 4) specifies the model pairs, baselines (including prior merging methods), benchmarks, metrics, and statistical analysis. We will revise the abstract to briefly note the scope of the experiments (e.g., multiple heterogeneous MLLM combinations evaluated on standard vision-language benchmarks) to improve verifiability without exceeding abstract length limits. revision: yes

Circularity Check

0 steps flagged

Empirical three-step procedure with no self-referential reductions or fitted predictions

full rationale

The paper describes an empirical method consisting of a mapping function, linear interpolation for parameter asymmetry, and unsupervised hyper-parameter search. No equations, derivations, or claims in the abstract reduce any output (e.g., merged model performance) to a quantity defined by the method itself or to a self-citation. The unsupervised search is presented as an optimization step over coefficients rather than a tautological prediction. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear. The central claim rests on experimental outperformance rather than any definitional equivalence, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the unstated effectiveness of the mapping and interpolation steps plus the assumption that unsupervised search can replace labeled-data tuning; no free parameters, axioms, or invented entities are explicitly listed.

pith-pipeline@v0.9.0 · 5753 in / 1030 out tokens · 48612 ms · 2026-05-22T22:43:09.941885+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 8 internal anchors

[1]

Model composition for multimodal large language models

Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, and Yang Liu. Model composition for multimodal large language models. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

work page 2024
[2]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Amit Agarwal, Zhe Chen, Mo Li, Yubo Ma, Hailong Sun, Xiangyu Zhao, Junbo Cui, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

work page 2024
[4]

Open llm leaderboard v2

Cl ´ementine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open- llm- leaderboard/open_llm_leaderboard, 2024

work page 2024
[5]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Arcee’s mergekit: A toolkit for merging large language models

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Bene- dict, Mark McQuade, and Jacob Solawetz. Arcee’s mergekit: A toolkit for merging large language models. arXiv preprint arXiv:2403.13257, 2024

work page arXiv 2024
[7]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

work page 2018
[8]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. CogVLM2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019

work page 2019
[10]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Seed-bench: Bench- marking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 13299–13308, 2024

work page 2024
[12]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

A survey on benchmarks of multimodal large language models

Jian Li and Weiheng Lu. A survey on benchmarks of multimodal large language models. arXiv preprint arXiv:2408.08632, 2024

work page arXiv 2024
[14]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[15]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

work page 2019
[17]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing. Association for Computational Linguis- tics, 2019

work page 2019
[18]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019
[19]

An empirical study of multimodal model merging

Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, and Lijuan Wang. An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933, 2023

work page arXiv 2023
[20]

Llama: Open and efficient foundation lan- guage models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timoth’ee Lacroix, Baptiste Rozi‘ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models

work page
[21]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neu- ral Information Processing Systems. Curran Associates, Inc., 2017

work page 2017
[22]

Knowledge fusion of large language models, 2024

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models, 2024

work page 2024
[23]

Knowledge fusion of chat llms: A preliminary technical report, 2024

Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, and Wei Bi. Knowledge fusion of chat llms: A preliminary technical report, 2024

work page 2024
[24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

TIES-Merging: Resolving interfer- ence when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raf- fel, and Mohit Bansal. TIES-Merging: Resolving interfer- ence when merging models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[27]

Qwen2 technical report, 2024

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, M...

work page 2024
[28]

mplug- owi2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug- owi2: Revolutionizing multi-modal large language model with modality collaboration. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13040–13051. IEEE, 2024

work page 2024
[29]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In Forty-first Interna- tional Conference on Machine Learning, 2024

work page 2024
[30]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9556– 9567, 2024

work page 2024
[31]

Lmms- eval: Reality check on the evaluation of large multimodal models, 2024

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024

work page 2024
[32]

MetaGPT: Merging large language models us- ing model exclusive task arithmetic

Yuyan Zhou, Liang Song, Bingning Wang, and Weipeng Chen. MetaGPT: Merging large language models us- ing model exclusive task arithmetic. arXiv preprint arXiv:2406.11385, 2024. A. Results on Additional Model Pairs We conducted experiments on additional model pairs, sum- marized in Table 9, which highlights the cumulative perfor- mance gains across tasks fo...

work page arXiv 2024
[33]

into CogVLM [25] (Table 2), (3) merging mPLUG- Owl2 into LLaV A-v1.5, (4) merging LLaV A-v1.5 into mPLUG-Owl2 [28] (Table 11), (5) merging CogVLM into mPLUG-Owl2 (Table 12) and (6) merging mPLUG-Owl2 into CogVLM (Table 13). The performance gain for each task is computed as the difference between the performance of our method (or baselines) and the average...

work page

[1] [1]

Model composition for multimodal large language models

Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, and Yang Liu. Model composition for multimodal large language models. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

work page 2024

[2] [2]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Amit Agarwal, Zhe Chen, Mo Li, Yubo Ma, Hailong Sun, Xiangyu Zhao, Junbo Cui, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

work page 2024

[4] [4]

Open llm leaderboard v2

Cl ´ementine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open- llm- leaderboard/open_llm_leaderboard, 2024

work page 2024

[5] [5]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Arcee’s mergekit: A toolkit for merging large language models

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Bene- dict, Mark McQuade, and Jacob Solawetz. Arcee’s mergekit: A toolkit for merging large language models. arXiv preprint arXiv:2403.13257, 2024

work page arXiv 2024

[7] [7]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

work page 2018

[8] [8]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. CogVLM2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019

work page 2019

[10] [10]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Seed-bench: Bench- marking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 13299–13308, 2024

work page 2024

[12] [12]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

A survey on benchmarks of multimodal large language models

Jian Li and Weiheng Lu. A survey on benchmarks of multimodal large language models. arXiv preprint arXiv:2408.08632, 2024

work page arXiv 2024

[14] [14]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024

[15] [15]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

work page 2019

[17] [17]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing. Association for Computational Linguis- tics, 2019

work page 2019

[18] [18]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019

[19] [19]

An empirical study of multimodal model merging

Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, and Lijuan Wang. An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933, 2023

work page arXiv 2023

[20] [20]

Llama: Open and efficient foundation lan- guage models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timoth’ee Lacroix, Baptiste Rozi‘ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models

work page

[21] [21]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neu- ral Information Processing Systems. Curran Associates, Inc., 2017

work page 2017

[22] [22]

Knowledge fusion of large language models, 2024

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models, 2024

work page 2024

[23] [23]

Knowledge fusion of chat llms: A preliminary technical report, 2024

Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, and Wei Bi. Knowledge fusion of chat llms: A preliminary technical report, 2024

work page 2024

[24] [24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

TIES-Merging: Resolving interfer- ence when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raf- fel, and Mohit Bansal. TIES-Merging: Resolving interfer- ence when merging models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[27] [27]

Qwen2 technical report, 2024

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, M...

work page 2024

[28] [28]

mplug- owi2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug- owi2: Revolutionizing multi-modal large language model with modality collaboration. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13040–13051. IEEE, 2024

work page 2024

[29] [29]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In Forty-first Interna- tional Conference on Machine Learning, 2024

work page 2024

[30] [30]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9556– 9567, 2024

work page 2024

[31] [31]

Lmms- eval: Reality check on the evaluation of large multimodal models, 2024

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024

work page 2024

[32] [32]

MetaGPT: Merging large language models us- ing model exclusive task arithmetic

Yuyan Zhou, Liang Song, Bingning Wang, and Weipeng Chen. MetaGPT: Merging large language models us- ing model exclusive task arithmetic. arXiv preprint arXiv:2406.11385, 2024. A. Results on Additional Model Pairs We conducted experiments on additional model pairs, sum- marized in Table 9, which highlights the cumulative perfor- mance gains across tasks fo...

work page arXiv 2024

[33] [33]

into CogVLM [25] (Table 2), (3) merging mPLUG- Owl2 into LLaV A-v1.5, (4) merging LLaV A-v1.5 into mPLUG-Owl2 [28] (Table 11), (5) merging CogVLM into mPLUG-Owl2 (Table 12) and (6) merging mPLUG-Owl2 into CogVLM (Table 13). The performance gain for each task is computed as the difference between the performance of our method (or baselines) and the average...

work page