Recognition: 2 theorem links
· Lean TheoremDiMtextsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
Pith reviewed 2026-05-14 20:20 UTC · model grok-4.3
The pith
Selective merging of direction- and magnitude-aware residual updates injects multilingual capability into multimodal models without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiM3 selectively composes multilingual and multimodal residual updates at each parameter dimension using direction and magnitude awareness, allowing the creation of a model with both capabilities from existing models without any additional training.
What carries the argument
Direction- and magnitude-aware merging of residual updates, which decides the composition ratio per dimension to preserve useful features from each update.
If this is right
- DiM3 outperforms existing merging baselines on multilingual benchmarks.
- It substantially boosts multilingual performance compared to the original multimodal model.
- It remains competitive with dedicated multilingual multimodal fine-tuning while retaining general multimodal ability.
- It can be applied to already-trained multilingual multimodal models to gain further improvements.
- It reshapes intermediate-layer semantic representations to strengthen cross-lingual alignment under text and multimodal inputs.
Where Pith is reading between the lines
- This selective merging might be useful for combining other types of model adaptations beyond language and vision.
- Reducing the cost of building multilingual multimodal systems could accelerate development of more inclusive AI tools.
- The focus on intermediate layers suggests that alignment tasks in models may be best addressed at mid-depth representations.
Load-bearing premise
The multilingual and multimodal residual updates differ in their directions and magnitudes in a way that allows selective composition without causing interference in the shared parameters.
What would settle it
If the DiM3-merged model shows lower performance than the original multimodal model on multimodal tasks or lower than a multilingual model on language tasks, that would indicate the merging does not work as claimed.
Figures
read the original abstract
Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DiM³, a training-free method for merging multilingual and multimodal residual updates in the shared language model backbone of existing multimodal models. By selectively composing updates per parameter dimension based on direction (cosine similarity) and magnitude (norm) signals, DiM³ aims to preserve the original vision encoder and multimodal projector while enhancing multilingual capabilities. Experiments across LLaVA- and Qwen-based backbones on benchmarks covering 57 languages demonstrate that DiM³ outperforms existing merging baselines, substantially improves multilingual performance over the original model, remains competitive with dedicated fine-tuning, and can be applied to already multilingual multimodal models for additional gains. Interpretability analysis indicates that DiM³ primarily affects intermediate-layer semantic representations to strengthen cross-lingual alignment under text-only and multimodal inputs.
Significance. If the results hold, DiM³ provides an efficient, training-free approach to bridge multilingual and multimodal capabilities in large models, reducing the need for expensive data construction and retraining. The method's applicability to multiple backbones and its ability to retain general multimodal abilities while improving multilingual performance make it potentially impactful for developing more general AI systems. The inclusion of interpretability analysis and extension to pre-trained multilingual multimodal models adds to its value, offering insights into how selective merging affects model representations.
major comments (1)
- [Experiments and Interpretability Analysis] The central assumption that multilingual and multimodal residual updates are sufficiently heterogeneous to allow per-dimension direction/magnitude selection without destructive interference on multimodal inputs is load-bearing but only indirectly supported. The interpretability analysis reports improved cross-lingual alignment in intermediate layers, yet no ablation replaces the selection rule with uniform averaging and measures representation drift or task performance specifically on combined text+image inputs (see Experiments and Interpretability sections).
minor comments (1)
- [Abstract] Notation for the method name is rendered as DiM³ in the title but appears as DiM^{3} in the abstract; ensure consistent superscript formatting throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The central concern about directly validating the heterogeneity assumption via ablation on multimodal inputs is well-taken, and we will strengthen the manuscript accordingly while preserving the original claims supported by existing results.
read point-by-point responses
-
Referee: [Experiments and Interpretability Analysis] The central assumption that multilingual and multimodal residual updates are sufficiently heterogeneous to allow per-dimension direction/magnitude selection without destructive interference on multimodal inputs is load-bearing but only indirectly supported. The interpretability analysis reports improved cross-lingual alignment in intermediate layers, yet no ablation replaces the selection rule with uniform averaging and measures representation drift or task performance specifically on combined text+image inputs (see Experiments and Interpretability sections).
Authors: We agree that an explicit ablation replacing the direction/magnitude selection with uniform averaging, evaluated on representation drift and task performance under combined text+image inputs, would provide more direct support for the heterogeneity assumption. The current manuscript shows that DiM³ largely retains multimodal benchmark performance (VQA, captioning, etc.) while improving multilingual results across 57 languages, and the interpretability analysis indicates preserved higher-layer task structure; these results are consistent with limited destructive interference. To address the point directly, we will add the requested ablation in the revised Experiments and Interpretability sections, reporting cosine similarity drift and downstream multimodal task metrics for both the selective rule and uniform averaging. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines DiM3 as an independent merging procedure that selectively composes residual updates using per-dimension cosine similarity and norm signals. This construction is stated directly from the heterogeneity assumption and is validated on external multilingual and multimodal benchmarks (57 languages, LLaVA/Qwen backbones) without any reduction of the central claim to fitted parameters, self-citations, or renamed known results. No equations or steps in the provided text equate outputs to inputs by construction; the method remains self-contained against external evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- direction and magnitude selection thresholds
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DiM3 performs selective composition rather than uniform aggregation, adaptively assigning their contributions based on local geometric importance in parameter space... ωk,j = 1/2 (smagk,j + sdirk,j)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 3.1 (Residual geometry)... ∥Wk,:,j − WN,:,j∥²₂ = (δmagk,j)² + 2mk(j)mN(j)δdirk,j
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
xgqa: Cross-lingual visual question answering
Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin O Steitz, Stefan Roth, Ivan Vuli´c, and Iryna Gurevych. xgqa: Cross-lingual visual question answering. InFindings of the association for computational linguistics: ACL 2022, pages 2497–2511, 2022
2022
-
[6]
Maxm: Towards multilingual visual question answering
Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. Maxm: Towards multilingual visual question answering. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2667–2682, 2023
2023
-
[7]
M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks
Florian Schneider and Sunayana Sitaram. M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4309–4345, 2024
2024
-
[8]
Cvqa: Culturally-diverse multilingual visual question answering benchmark
David Orlando Romero Mogrovejo, Chenyang Lyu, Haryo Akbarianto Wibowo, Santiago Gón- gora, Aishik Mandal, Sukannya Purkayastha, Jesus-German Ortiz-Barajas, Emilio Villa Cueva, Jinheon Baek, Soyeong Jeong, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. InThe Thirty-eight Conference on Neural Information Processing Systems...
2024
-
[9]
Pangea: A fully open multilingual multimodal llm for 39 languages
Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, and Graham Neu- big. Pangea: A fully open multilingual multimodal llm for 39 languages. InThe Thirteenth International Conference on Learning Representations, 2024
2024
-
[10]
Parrot: Multilingual visual instruction tuning
Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, et al. Parrot: Multilingual visual instruction tuning. In Forty-second International Conference on Machine Learning
-
[11]
mblip: Efficient bootstrapping of multilingual vision-llms
Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. mblip: Efficient bootstrapping of multilingual vision-llms. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 7–25, 2024. 10
2024
-
[12]
Unlocking the potential of model merging for low-resource languages
Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, and Yansong Feng. Unlocking the potential of model merging for low-resource languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8705–8720, Miami, Florida, USA, November
2024
-
[13]
Association for Computational Linguistics
-
[14]
Rossi, and Thien Huu Nguyen
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings o...
2024
-
[15]
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...
2022
-
[16]
Model tailor: mitigating catastrophic forgetting in multi-modal large language models
Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Chao Wu, and Kun Kuang. Model tailor: mitigating catastrophic forgetting in multi-modal large language models. In Proceedings of the 41st International Conference on Machine Learning, pages 62581–62598, 2024
2024
-
[17]
Sens-merging: Sensitivity-guided parameter balancing for merging large language models
Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, and Linqi Song. Sens-merging: Sensitivity-guided parameter balancing for merging large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19243–19255, 2025
2025
-
[18]
Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024
Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024
2024
-
[19]
Editing models with task arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations
-
[20]
Language models are super mario: Absorbing abilities from homologous models as a free lunch
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024
2024
-
[21]
Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023
2023
-
[22]
One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025
Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, and Jingbo Zhu. One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025
2025
-
[23]
Understanding cross-lingual alignment—a survey
Katharina Hämmerl, Jindˇrich Libovick`y, and Alexander Fraser. Understanding cross-lingual alignment—a survey. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10922–10943, 2024
2024
-
[24]
mt5: A massively multilingual pre-trained text-to-text trans- former
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text trans- former. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 483–498, 2021
2021
-
[25]
Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly
Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen, Jixing Li, and Shujian Huang. Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...
2024
-
[26]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
2023
-
[27]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
2023
-
[28]
Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016
Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016
2016
-
[29]
Dora: Weight-decomposed low-rank adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024
2024
-
[30]
The geometry of multilingual language model representations
Tyler Chang, Zhuowen Tu, and Benjamin Bergen. The geometry of multilingual language model representations. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 119–136, 2022
2022
-
[31]
Language surgery in multilingual large language models
Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, and Samuel Cahyawijaya. Language surgery in multilingual large language models. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 438–467, 2025
2025
-
[32]
Do llamas work in english? on the latent language of multilingual transformers
Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in english? on the latent language of multilingual transformers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366–15394, 2024
2024
-
[33]
Yuxin Chen, Zhengzhou Cai, Xiangtian Ji, Weixiang Zhao, An Zhang, Xiang Wang, and Tat- Seng Chua. Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026
-
[34]
From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment
Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, and Xiaodong Shi. From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28956–28974, 2025
2025
-
[35]
Mengyu Bu and Yang Feng. Language on demand, knowledge at core: Composing llms with encoder-decoder translation models for extensible multilinguality.arXiv preprint arXiv:2603.17512, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Alignx: Advancing multilingual large language models with multilingual representation alignment
Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, and Yang Feng. Alignx: Advancing multilingual large language models with multilingual representation alignment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6471– 6500, 2025
2025
-
[37]
How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024
Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024
2024
-
[38]
Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022
2022
-
[39]
Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, et al. Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026
-
[40]
MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning
Xinhan Zheng, Huyu Wu, Xueting Wang, and Haiyun Jiang. Unveiling intrinsic text bias in multimodal large language models through attention key-space analysis.arXiv preprint arXiv:2510.26721, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Palo: A polyglot large multimodal model for 5b people
Hanoona Rasheed, Muhammad Maaz, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Tim Baldwin, Michael Felsberg, and Fahad S Khan. Palo: A polyglot large multimodal model for 5b people. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1745–1754, 2025
2025
-
[42]
Centurio: On drivers of multilingual ability of large vision- language model
Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, and Goran Glavaš. Centurio: On drivers of multilingual ability of large vision- language model. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2831–2881, 2025
2025
-
[43]
Breaking language barriers in visual language models via multilingual textual regularization
Iñigo Pikabea, Iñaki Lacunza, Oriol Pareras Velasco, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, and Marta Villegas. Breaking language barriers in visual language models via multilingual textual regularization. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Ch...
2025
-
[44]
Language-specific layer matters: Efficient multilingual enhancement for large vision-language models
Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, and Jingbo Zhu. Language-specific layer matters: Efficient multilingual enhancement for large vision-language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 12473–12500, 2025
2025
-
[45]
Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026
2026
-
[46]
Zijing Wang, Xingle Xu, Yongkang Liu, Yiqun Zhang, Peiqin Lin, Shi Feng, Xiaocui Yang, Daling Wang, and Hinrich Schütze. Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025
-
[47]
Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025
Zijing Wang, Yongkang Liu, Yingfeng Luo, Ming Wang, Zhen Song, Shi Feng, Xiaocui Yang, Dingyang Lin, Daling Wang, Yifei Zhang, et al. Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025
2025
-
[48]
Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024
-
[49]
Whoever started the interference should end it: Guiding data-free model merging via task vectors
Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InForty-second International Conference on Machine Learning
-
[50]
Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization
Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, et al. Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9413–9422, 2025
2025
-
[51]
Model breadcrumbs: Scaling multi-task model merging with sparse masks
MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InEuropean Conference on Computer Vision, pages 270–287. Springer, 2024
2024
-
[52]
Magmax: Leveraging model merging for seamless continual learning
Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024
2024
-
[53]
Parameter competition balancing for model merging
Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. Parameter competition balancing for model merging. Advances in Neural Information Processing Systems, 37:84746–84776, 2024
2024
-
[54]
Superpose task-specific features for model merging
Haiquan Qiu, You Wu, Dong Li, Jianmin Guo, and Quanming Yao. Superpose task-specific features for model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4200–4214, 2025. 13
2025
-
[55]
To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging
Zitao Fang, Guodong Du, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, and Sim Kuan Goh. To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15731–15751, 2025
2025
-
[56]
Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025
-
[57]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Shaoxiong Ji, Zihao Li, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O’Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, and Barry Haddow. Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024
-
[59]
Qwen2 technical report. 2024
2024
-
[60]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani. Afriquellm: How data mixing and model architecture impact continued pre-training for african languages.arXiv preprint arXiv:2601.06395, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[62]
Xcopa: A multilingual dataset for causal commonsense reasoning
Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vuli ´c, and Anna Korhonen. Xcopa: A multilingual dataset for causal commonsense reasoning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, 2020
2020
-
[63]
Few-shot learning with multilingual generative language models
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. Few-shot learning with multilingual generative language models. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 9019–9052, 2022
2022
-
[64]
Xnli: Evaluating cross-lingual sentence representations
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2475–2485, 2018
2018
-
[65]
Visually grounded reasoning across languages and cultures
Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467– 10485, 2021
2021
-
[66]
Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesu- joba Oluwadara Alabi, Muhidin A Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, et al. Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026
-
[67]
Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
2024
-
[68]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024. 14
2024
-
[69]
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024
-
[70]
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782, 2024
-
[71]
Lmms-eval: Reality check on the evaluation of large multimodal models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 15 A Baselines and Benchmarks All experiments were...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.