DiMtextsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
Pith reviewed 2026-05-21 09:07 UTC · model grok-4.3
The pith
DiM3 adds multilingual capabilities to multimodal models by selectively merging residual updates based on direction and magnitude at each parameter.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiM3 bridges multilingual and multimodal models by direction- and magnitude-aware merging of residual updates in the language model backbone. This selective composition at each parameter dimension preserves the vision encoder and projector while enhancing multilingual performance across text-only and vision-language tasks, as shown in experiments on LLaVA- and Qwen-based models covering 57 languages.
What carries the argument
Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which analyzes the direction and magnitude of the two residual updates to set per-dimension composition weights and thereby reduce destructive interference in the shared parameters.
Load-bearing premise
Multilingual and multimodal residual updates differ enough in direction and magnitude that a selective per-dimension rule can combine them without destructive interference.
What would settle it
Running the same benchmarks with uniform averaging of the two residual updates instead of the direction-and-magnitude rule and obtaining equal or higher multilingual accuracy while preserving multimodal scores would falsify the value of the selective rule.
Figures
read the original abstract
Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DiM³, a training-free method for composing multilingual and multimodal residual updates via a per-dimension direction- and magnitude-aware rule applied to the shared language-model backbone of models such as LLaVA and Qwen. Experiments across text-only and vision-language multilingual benchmarks spanning 57 languages show that DiM³ outperforms existing merging baselines, substantially improves multilingual performance relative to the original multimodal model, remains competitive with dedicated multilingual multimodal fine-tuning, largely retains general multimodal ability, and can yield further gains when applied to already-trained multilingual multimodal models. Interpretability analysis indicates that the method primarily reshapes intermediate-layer semantic representations to strengthen cross-lingual alignment while preserving higher-layer task structure.
Significance. If the empirical results hold, the work demonstrates a practical, low-cost route to extending multimodal models to many languages without constructing large multilingual multimodal datasets or performing end-to-end retraining. The breadth of evaluation (multiple backbones, 57 languages, both text-only and vision-language settings, plus comparisons to baselines and full fine-tuning) and the public code repository constitute clear strengths for reproducibility and adoption.
major comments (2)
- [Abstract / Experiments] Abstract and experimental results: the reported performance gains are presented without accompanying information on statistical significance, standard deviations across multiple runs, or the precise hyperparameter values used for the merging coefficients; these omissions make it difficult to judge the robustness of the central claim that DiM³ consistently outperforms baselines across 57 languages.
- [Method] Method description: the selective per-dimension composition rule presupposes sufficient heterogeneity between multilingual and multimodal residual updates to avoid destructive interference, yet the manuscript provides no quantitative diagnostic (e.g., cosine similarity or magnitude histograms per layer) that would allow readers to verify when this premise holds.
minor comments (2)
- [Interpretability analysis] Figure captions and axis labels in the interpretability plots could be expanded to clarify which layers correspond to the reported semantic-alignment improvements.
- [Method] A brief statement of the exact number of parameters updated by each residual (multilingual vs. multimodal) would help readers assess the scale of the merging operation.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and constructive comments. We address the two major comments point by point below, with revisions planned where they strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental results: the reported performance gains are presented without accompanying information on statistical significance, standard deviations across multiple runs, or the precise hyperparameter values used for the merging coefficients; these omissions make it difficult to judge the robustness of the central claim that DiM³ consistently outperforms baselines across 57 languages.
Authors: We thank the referee for highlighting this. DiM³ is a fully deterministic, training-free procedure: given fixed residual updates and fixed coefficients, the output is identical across runs, so standard deviations from repeated executions do not apply in the manner they do for stochastic fine-tuning. We will nevertheless strengthen the presentation by adding a table (or appendix) that reports the exact numerical values of all merging coefficients (direction- and magnitude-scaling factors) used in every experiment and backbone. For statistical significance, the breadth of the 57-language evaluation already shows consistent directional gains; we will add a short note on cross-language consistency in the revised text. These changes address the robustness concern while remaining proportionate to a minor revision. revision: partial
-
Referee: [Method] Method description: the selective per-dimension composition rule presupposes sufficient heterogeneity between multilingual and multimodal residual updates to avoid destructive interference, yet the manuscript provides no quantitative diagnostic (e.g., cosine similarity or magnitude histograms per layer) that would allow readers to verify when this premise holds.
Authors: We agree that explicit diagnostics would help readers assess the heterogeneity premise. In the revised manuscript we will insert a short analysis subsection (or appendix) that reports (i) layer-wise cosine similarities between the multilingual and multimodal residual vectors and (ii) per-layer magnitude histograms or summary statistics. These quantities are already computable from the updates we used and will be added without new experiments. The added material will directly illustrate the degree of directional and magnitude divergence that motivates the per-dimension rule. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper proposes an explicit, training-free merging rule (DiM3) that selectively composes independently obtained residual updates based on per-dimension direction and magnitude. All performance claims are empirical measurements on held-out benchmarks (multilingual text and vision-language tasks across 57 languages, LLaVA/Qwen backbones) rather than quantities derived from the merging equations themselves. No self-citation chains, fitted parameters renamed as predictions, or self-definitional loops appear in the method definition or central results. The composition rule is defined directly from the updates and tested independently, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use both δmag k,j and δdir k,j to estimate source salience... ωk,j = ½(smag k,j + sdir k,j) ... fW:,j = WN,:,j + Σ ωk,j Δ(W)k:,j
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
xgqa: Cross-lingual visual question answering
Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin O Steitz, Stefan Roth, Ivan Vuli´c, and Iryna Gurevych. xgqa: Cross-lingual visual question answering. InFindings of the association for computational linguistics: ACL 2022, pages 2497–2511, 2022
work page 2022
-
[6]
Maxm: Towards multilingual visual question answering
Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. Maxm: Towards multilingual visual question answering. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2667–2682, 2023
work page 2023
-
[7]
Florian Schneider and Sunayana Sitaram. M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4309–4345, 2024
work page 2024
-
[8]
Cvqa: Culturally-diverse multilingual visual question answering benchmark
David Orlando Romero Mogrovejo, Chenyang Lyu, Haryo Akbarianto Wibowo, Santiago Gón- gora, Aishik Mandal, Sukannya Purkayastha, Jesus-German Ortiz-Barajas, Emilio Villa Cueva, Jinheon Baek, Soyeong Jeong, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. InThe Thirty-eight Conference on Neural Information Processing Systems...
work page 2024
-
[9]
Pangea: A fully open multilingual multimodal llm for 39 languages
Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, and Graham Neu- big. Pangea: A fully open multilingual multimodal llm for 39 languages. InThe Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[10]
Parrot: Multilingual visual instruction tuning
Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, et al. Parrot: Multilingual visual instruction tuning. In Forty-second International Conference on Machine Learning
-
[11]
mblip: Efficient bootstrapping of multilingual vision-llms
Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. mblip: Efficient bootstrapping of multilingual vision-llms. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 7–25, 2024. 10
work page 2024
-
[12]
Unlocking the potential of model merging for low-resource languages
Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, and Yansong Feng. Unlocking the potential of model merging for low-resource languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8705–8720, Miami, Florida, USA, November
work page 2024
-
[13]
Association for Computational Linguistics
-
[14]
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings o...
work page 2024
-
[15]
Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...
work page 2022
-
[16]
Model tailor: mitigating catastrophic forgetting in multi-modal large language models
Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Chao Wu, and Kun Kuang. Model tailor: mitigating catastrophic forgetting in multi-modal large language models. In Proceedings of the 41st International Conference on Machine Learning, pages 62581–62598, 2024
work page 2024
-
[17]
Sens-merging: Sensitivity-guided parameter balancing for merging large language models
Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, and Linqi Song. Sens-merging: Sensitivity-guided parameter balancing for merging large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19243–19255, 2025
work page 2025
-
[18]
Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024
work page 2024
-
[19]
Editing models with task arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations
-
[20]
Language models are super mario: Absorbing abilities from homologous models as a free lunch
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[21]
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023
work page 2023
-
[22]
One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025
Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, and Jingbo Zhu. One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025
work page 2025
-
[23]
Understanding cross-lingual alignment—a survey
Katharina Hämmerl, Jindˇrich Libovick`y, and Alexander Fraser. Understanding cross-lingual alignment—a survey. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10922–10943, 2024
work page 2024
-
[24]
mt5: A massively multilingual pre-trained text-to-text trans- former
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text trans- former. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 483–498, 2021
work page 2021
-
[25]
Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen, Jixing Li, and Shujian Huang. Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...
work page 2024
-
[26]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
work page 2023
-
[27]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[28]
Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016
work page 2016
-
[29]
Dora: Weight-decomposed low-rank adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[30]
The geometry of multilingual language model representations
Tyler Chang, Zhuowen Tu, and Benjamin Bergen. The geometry of multilingual language model representations. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 119–136, 2022
work page 2022
-
[31]
Language surgery in multilingual large language models
Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, and Samuel Cahyawijaya. Language surgery in multilingual large language models. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 438–467, 2025
work page 2025
-
[32]
Do llamas work in english? on the latent language of multilingual transformers
Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in english? on the latent language of multilingual transformers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366–15394, 2024
work page 2024
-
[33]
Yuxin Chen, Zhengzhou Cai, Xiangtian Ji, Weixiang Zhao, An Zhang, Xiang Wang, and Tat- Seng Chua. Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026
-
[34]
Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, and Xiaodong Shi. From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28956–28974, 2025
work page 2025
-
[35]
Mengyu Bu and Yang Feng. Language on demand, knowledge at core: Composing llms with encoder-decoder translation models for extensible multilinguality.arXiv preprint arXiv:2603.17512, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Alignx: Advancing multilingual large language models with multilingual representation alignment
Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, and Yang Feng. Alignx: Advancing multilingual large language models with multilingual representation alignment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6471– 6500, 2025
work page 2025
-
[37]
Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024
work page 2024
-
[38]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022
work page 2022
-
[39]
Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, et al. Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026
-
[40]
MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning
Xinhan Zheng, Huyu Wu, Xueting Wang, and Haiyun Jiang. Unveiling intrinsic text bias in multimodal large language models through attention key-space analysis.arXiv preprint arXiv:2510.26721, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Palo: A polyglot large multimodal model for 5b people
Hanoona Rasheed, Muhammad Maaz, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Tim Baldwin, Michael Felsberg, and Fahad S Khan. Palo: A polyglot large multimodal model for 5b people. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1745–1754, 2025
work page 2025
-
[42]
Centurio: On drivers of multilingual ability of large vision- language model
Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, and Goran Glavaš. Centurio: On drivers of multilingual ability of large vision- language model. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2831–2881, 2025
work page 2025
-
[43]
Breaking language barriers in visual language models via multilingual textual regularization
Iñigo Pikabea, Iñaki Lacunza, Oriol Pareras Velasco, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, and Marta Villegas. Breaking language barriers in visual language models via multilingual textual regularization. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Ch...
work page 2025
-
[44]
Language-specific layer matters: Efficient multilingual enhancement for large vision-language models
Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, and Jingbo Zhu. Language-specific layer matters: Efficient multilingual enhancement for large vision-language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 12473–12500, 2025
work page 2025
-
[45]
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026
work page 2026
-
[46]
Zijing Wang, Xingle Xu, Yongkang Liu, Yiqun Zhang, Peiqin Lin, Shi Feng, Xiaocui Yang, Daling Wang, and Hinrich Schütze. Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025
-
[47]
Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025
Zijing Wang, Yongkang Liu, Yingfeng Luo, Ming Wang, Zhen Song, Shi Feng, Xiaocui Yang, Dingyang Lin, Daling Wang, Yifei Zhang, et al. Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025
work page 2025
-
[48]
Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024
-
[49]
Whoever started the interference should end it: Guiding data-free model merging via task vectors
Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InForty-second International Conference on Machine Learning
-
[50]
Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, et al. Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9413–9422, 2025
work page 2025
-
[51]
Model breadcrumbs: Scaling multi-task model merging with sparse masks
MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InEuropean Conference on Computer Vision, pages 270–287. Springer, 2024
work page 2024
-
[52]
Magmax: Leveraging model merging for seamless continual learning
Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024
work page 2024
-
[53]
Parameter competition balancing for model merging
Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. Parameter competition balancing for model merging. Advances in Neural Information Processing Systems, 37:84746–84776, 2024
work page 2024
-
[54]
Superpose task-specific features for model merging
Haiquan Qiu, You Wu, Dong Li, Jianmin Guo, and Quanming Yao. Superpose task-specific features for model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4200–4214, 2025. 13
work page 2025
-
[55]
Zitao Fang, Guodong Du, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, and Sim Kuan Goh. To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15731–15751, 2025
work page 2025
-
[56]
Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025
-
[57]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Shaoxiong Ji, Zihao Li, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O’Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, and Barry Haddow. Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024
-
[59]
Qwen2 technical report. 2024
work page 2024
-
[60]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani. Afriquellm: How data mixing and model architecture impact continued pre-training for african languages.arXiv preprint arXiv:2601.06395, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[62]
Xcopa: A multilingual dataset for causal commonsense reasoning
Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vuli ´c, and Anna Korhonen. Xcopa: A multilingual dataset for causal commonsense reasoning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, 2020
work page 2020
-
[63]
Few-shot learning with multilingual generative language models
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. Few-shot learning with multilingual generative language models. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 9019–9052, 2022
work page 2022
-
[64]
Xnli: Evaluating cross-lingual sentence representations
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2475–2485, 2018
work page 2018
-
[65]
Visually grounded reasoning across languages and cultures
Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467– 10485, 2021
work page 2021
-
[66]
Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesu- joba Oluwadara Alabi, Muhidin A Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, et al. Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026
-
[67]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
work page 2024
-
[68]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024. 14
work page 2024
-
[69]
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024
-
[70]
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Lmms-eval: Reality check on the evaluation of large multimodal models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 15 A Baselines and Benchmarks All experiments were...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.