pith. sign in

arxiv: 2605.12960 · v2 · pith:KD37OJPHnew · submitted 2026-05-13 · 💻 cs.CL

DiMtextsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

Pith reviewed 2026-05-21 09:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords multilingual multimodal mergingdirection and magnitude awaretraining-free adaptationresidual update compositioncross-lingual alignmentvision-language modelsmodel mergingparameter selective composition
0
0 comments X

The pith

DiM3 adds multilingual capabilities to multimodal models by selectively merging residual updates based on direction and magnitude at each parameter.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DiM3 as a training-free approach to equip existing multimodal models with support for many languages. It does this by composing updates from multilingual and multimodal training in the shared language-model backbone while leaving the vision encoder and projector unchanged. The method decides how to combine the updates at every parameter dimension according to their directions and magnitudes to limit interference. Experiments across LLaVA- and Qwen-based models and 57 languages show gains over standard merging methods, better multilingual results than the original multimodal model, and performance close to full multilingual multimodal fine-tuning. The same procedure can be applied to already multilingual multimodal models for extra improvement and primarily affects intermediate-layer representations to strengthen cross-lingual alignment.

Core claim

DiM3 bridges multilingual and multimodal models by direction- and magnitude-aware merging of residual updates in the language model backbone. This selective composition at each parameter dimension preserves the vision encoder and projector while enhancing multilingual performance across text-only and vision-language tasks, as shown in experiments on LLaVA- and Qwen-based models covering 57 languages.

What carries the argument

Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which analyzes the direction and magnitude of the two residual updates to set per-dimension composition weights and thereby reduce destructive interference in the shared parameters.

Load-bearing premise

Multilingual and multimodal residual updates differ enough in direction and magnitude that a selective per-dimension rule can combine them without destructive interference.

What would settle it

Running the same benchmarks with uniform averaging of the two residual updates instead of the direction-and-magnitude rule and obtaining equal or higher multilingual accuracy while preserving multimodal scores would falsify the value of the selective rule.

Figures

Figures reproduced from arXiv: 2605.12960 by Daling Wang, Ercong Nie, Hinrich Sch\"utze, Mengjie Zhao, Mingyang Wang, Shi Feng, Xiaocui Yang, Yongkang Liu, Zijing Wang.

Figure 1
Figure 1. Figure 1: Residual heterogeneity in the shared language model backbone. The panels show residual norm, base-relative reorientation, and cross-residual alignment for ∆ml and ∆mm across layers and modules. Together, these diagnostics reveal that multilingual and multimodal adaptations differ in both update magnitude and geometry, motivating selective rather than uniform composition. same backbone to cooperate with pro… view at source ↗
Figure 2
Figure 2. Figure 2: Results on three general multimodal benchmarks [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualizations of average-pooled hidden states on multilingual text inputs from XNLI [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise silhouette scores of multilingual hidden-state representations under multilingual [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualizations of average-pooled hidden states for the question spans in multilingual [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional t-SNE visualizations of average-pooled hidden states on multilingual text inputs [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Full layer-wise t-SNE visualizations of average-pooled hidden states on multilingual text [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Full layer-wise t-SNE visualizations of average-pooled hidden states on multilingual text [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Full layer-wise t-SNE visualizations of average-pooled hidden states on multilingual text [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
read the original abstract

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DiM³, a training-free method for composing multilingual and multimodal residual updates via a per-dimension direction- and magnitude-aware rule applied to the shared language-model backbone of models such as LLaVA and Qwen. Experiments across text-only and vision-language multilingual benchmarks spanning 57 languages show that DiM³ outperforms existing merging baselines, substantially improves multilingual performance relative to the original multimodal model, remains competitive with dedicated multilingual multimodal fine-tuning, largely retains general multimodal ability, and can yield further gains when applied to already-trained multilingual multimodal models. Interpretability analysis indicates that the method primarily reshapes intermediate-layer semantic representations to strengthen cross-lingual alignment while preserving higher-layer task structure.

Significance. If the empirical results hold, the work demonstrates a practical, low-cost route to extending multimodal models to many languages without constructing large multilingual multimodal datasets or performing end-to-end retraining. The breadth of evaluation (multiple backbones, 57 languages, both text-only and vision-language settings, plus comparisons to baselines and full fine-tuning) and the public code repository constitute clear strengths for reproducibility and adoption.

major comments (2)
  1. [Abstract / Experiments] Abstract and experimental results: the reported performance gains are presented without accompanying information on statistical significance, standard deviations across multiple runs, or the precise hyperparameter values used for the merging coefficients; these omissions make it difficult to judge the robustness of the central claim that DiM³ consistently outperforms baselines across 57 languages.
  2. [Method] Method description: the selective per-dimension composition rule presupposes sufficient heterogeneity between multilingual and multimodal residual updates to avoid destructive interference, yet the manuscript provides no quantitative diagnostic (e.g., cosine similarity or magnitude histograms per layer) that would allow readers to verify when this premise holds.
minor comments (2)
  1. [Interpretability analysis] Figure captions and axis labels in the interpretability plots could be expanded to clarify which layers correspond to the reported semantic-alignment improvements.
  2. [Method] A brief statement of the exact number of parameters updated by each residual (multilingual vs. multimodal) would help readers assess the scale of the merging operation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We address the two major comments point by point below, with revisions planned where they strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experimental results: the reported performance gains are presented without accompanying information on statistical significance, standard deviations across multiple runs, or the precise hyperparameter values used for the merging coefficients; these omissions make it difficult to judge the robustness of the central claim that DiM³ consistently outperforms baselines across 57 languages.

    Authors: We thank the referee for highlighting this. DiM³ is a fully deterministic, training-free procedure: given fixed residual updates and fixed coefficients, the output is identical across runs, so standard deviations from repeated executions do not apply in the manner they do for stochastic fine-tuning. We will nevertheless strengthen the presentation by adding a table (or appendix) that reports the exact numerical values of all merging coefficients (direction- and magnitude-scaling factors) used in every experiment and backbone. For statistical significance, the breadth of the 57-language evaluation already shows consistent directional gains; we will add a short note on cross-language consistency in the revised text. These changes address the robustness concern while remaining proportionate to a minor revision. revision: partial

  2. Referee: [Method] Method description: the selective per-dimension composition rule presupposes sufficient heterogeneity between multilingual and multimodal residual updates to avoid destructive interference, yet the manuscript provides no quantitative diagnostic (e.g., cosine similarity or magnitude histograms per layer) that would allow readers to verify when this premise holds.

    Authors: We agree that explicit diagnostics would help readers assess the heterogeneity premise. In the revised manuscript we will insert a short analysis subsection (or appendix) that reports (i) layer-wise cosine similarities between the multilingual and multimodal residual vectors and (ii) per-layer magnitude histograms or summary statistics. These quantities are already computable from the updates we used and will be added without new experiments. The added material will directly illustrate the degree of directional and magnitude divergence that motivates the per-dimension rule. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes an explicit, training-free merging rule (DiM3) that selectively composes independently obtained residual updates based on per-dimension direction and magnitude. All performance claims are empirical measurements on held-out benchmarks (multilingual text and vision-language tasks across 57 languages, LLaVA/Qwen backbones) rather than quantities derived from the merging equations themselves. No self-citation chains, fitted parameters renamed as predictions, or self-definitional loops appear in the method definition or central results. The composition rule is defined directly from the updates and tested independently, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method relies on the empirical observation that multilingual and multimodal updates occupy different directions and magnitudes in parameter space; no new mathematical axioms or invented physical entities are introduced. The only potential free parameters are the per-layer or per-dimension weighting coefficients that implement the selective composition, but these are not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5818 in / 1307 out tokens · 49156 ms · 2026-05-21T09:07:33.371434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 10 internal anchors

  1. [1]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  2. [2]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  3. [3]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  4. [4]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  5. [5]

    xgqa: Cross-lingual visual question answering

    Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin O Steitz, Stefan Roth, Ivan Vuli´c, and Iryna Gurevych. xgqa: Cross-lingual visual question answering. InFindings of the association for computational linguistics: ACL 2022, pages 2497–2511, 2022

  6. [6]

    Maxm: Towards multilingual visual question answering

    Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. Maxm: Towards multilingual visual question answering. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2667–2682, 2023

  7. [7]

    M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks

    Florian Schneider and Sunayana Sitaram. M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4309–4345, 2024

  8. [8]

    Cvqa: Culturally-diverse multilingual visual question answering benchmark

    David Orlando Romero Mogrovejo, Chenyang Lyu, Haryo Akbarianto Wibowo, Santiago Gón- gora, Aishik Mandal, Sukannya Purkayastha, Jesus-German Ortiz-Barajas, Emilio Villa Cueva, Jinheon Baek, Soyeong Jeong, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. InThe Thirty-eight Conference on Neural Information Processing Systems...

  9. [9]

    Pangea: A fully open multilingual multimodal llm for 39 languages

    Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, and Graham Neu- big. Pangea: A fully open multilingual multimodal llm for 39 languages. InThe Thirteenth International Conference on Learning Representations, 2024

  10. [10]

    Parrot: Multilingual visual instruction tuning

    Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, et al. Parrot: Multilingual visual instruction tuning. In Forty-second International Conference on Machine Learning

  11. [11]

    mblip: Efficient bootstrapping of multilingual vision-llms

    Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. mblip: Efficient bootstrapping of multilingual vision-llms. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 7–25, 2024. 10

  12. [12]

    Unlocking the potential of model merging for low-resource languages

    Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, and Yansong Feng. Unlocking the potential of model merging for low-resource languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8705–8720, Miami, Florida, USA, November

  13. [13]

    Association for Computational Linguistics

  14. [14]

    Rossi, and Thien Huu Nguyen

    Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings o...

  15. [15]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

    Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

  16. [16]

    Model tailor: mitigating catastrophic forgetting in multi-modal large language models

    Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Chao Wu, and Kun Kuang. Model tailor: mitigating catastrophic forgetting in multi-modal large language models. In Proceedings of the 41st International Conference on Machine Learning, pages 62581–62598, 2024

  17. [17]

    Sens-merging: Sensitivity-guided parameter balancing for merging large language models

    Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, and Linqi Song. Sens-merging: Sensitivity-guided parameter balancing for merging large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19243–19255, 2025

  18. [18]

    Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

    Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

  19. [19]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations

  20. [20]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024

  21. [21]

    Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

  22. [22]

    One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025

    Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, and Jingbo Zhu. One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025

  23. [23]

    Understanding cross-lingual alignment—a survey

    Katharina Hämmerl, Jindˇrich Libovick`y, and Alexander Fraser. Understanding cross-lingual alignment—a survey. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10922–10943, 2024

  24. [24]

    mt5: A massively multilingual pre-trained text-to-text trans- former

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text trans- former. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 483–498, 2021

  25. [25]

    Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly

    Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen, Jixing Li, and Shujian Huang. Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...

  26. [26]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  27. [27]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  28. [28]

    Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

    Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

  29. [29]

    Dora: Weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024

  30. [30]

    The geometry of multilingual language model representations

    Tyler Chang, Zhuowen Tu, and Benjamin Bergen. The geometry of multilingual language model representations. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 119–136, 2022

  31. [31]

    Language surgery in multilingual large language models

    Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, and Samuel Cahyawijaya. Language surgery in multilingual large language models. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 438–467, 2025

  32. [32]

    Do llamas work in english? on the latent language of multilingual transformers

    Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in english? on the latent language of multilingual transformers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366–15394, 2024

  33. [33]

    Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026

    Yuxin Chen, Zhengzhou Cai, Xiangtian Ji, Weixiang Zhao, An Zhang, Xiang Wang, and Tat- Seng Chua. Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026

  34. [34]

    From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment

    Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, and Xiaodong Shi. From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28956–28974, 2025

  35. [35]

    Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality

    Mengyu Bu and Yang Feng. Language on demand, knowledge at core: Composing llms with encoder-decoder translation models for extensible multilinguality.arXiv preprint arXiv:2603.17512, 2026

  36. [36]

    Alignx: Advancing multilingual large language models with multilingual representation alignment

    Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, and Yang Feng. Alignx: Advancing multilingual large language models with multilingual representation alignment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6471– 6500, 2025

  37. [37]

    How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024

    Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024

  38. [38]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

  39. [39]

    Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026

    Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, et al. Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026

  40. [40]

    MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

    Xinhan Zheng, Huyu Wu, Xueting Wang, and Haiyun Jiang. Unveiling intrinsic text bias in multimodal large language models through attention key-space analysis.arXiv preprint arXiv:2510.26721, 2025. 12

  41. [41]

    Palo: A polyglot large multimodal model for 5b people

    Hanoona Rasheed, Muhammad Maaz, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Tim Baldwin, Michael Felsberg, and Fahad S Khan. Palo: A polyglot large multimodal model for 5b people. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1745–1754, 2025

  42. [42]

    Centurio: On drivers of multilingual ability of large vision- language model

    Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, and Goran Glavaš. Centurio: On drivers of multilingual ability of large vision- language model. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2831–2881, 2025

  43. [43]

    Breaking language barriers in visual language models via multilingual textual regularization

    Iñigo Pikabea, Iñaki Lacunza, Oriol Pareras Velasco, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, and Marta Villegas. Breaking language barriers in visual language models via multilingual textual regularization. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Ch...

  44. [44]

    Language-specific layer matters: Efficient multilingual enhancement for large vision-language models

    Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, and Jingbo Zhu. Language-specific layer matters: Efficient multilingual enhancement for large vision-language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 12473–12500, 2025

  45. [45]

    Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

  46. [46]

    Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025

    Zijing Wang, Xingle Xu, Yongkang Liu, Yiqun Zhang, Peiqin Lin, Shi Feng, Xiaocui Yang, Daling Wang, and Hinrich Schütze. Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025

  47. [47]

    Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025

    Zijing Wang, Yongkang Liu, Yingfeng Luo, Ming Wang, Zhen Song, Shi Feng, Xiaocui Yang, Dingyang Lin, Daling Wang, Yifei Zhang, et al. Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025

  48. [48]

    Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

    Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

  49. [49]

    Whoever started the interference should end it: Guiding data-free model merging via task vectors

    Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InForty-second International Conference on Machine Learning

  50. [50]

    Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization

    Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, et al. Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9413–9422, 2025

  51. [51]

    Model breadcrumbs: Scaling multi-task model merging with sparse masks

    MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InEuropean Conference on Computer Vision, pages 270–287. Springer, 2024

  52. [52]

    Magmax: Leveraging model merging for seamless continual learning

    Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024

  53. [53]

    Parameter competition balancing for model merging

    Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. Parameter competition balancing for model merging. Advances in Neural Information Processing Systems, 37:84746–84776, 2024

  54. [54]

    Superpose task-specific features for model merging

    Haiquan Qiu, You Wu, Dong Li, Jianmin Guo, and Quanming Yao. Superpose task-specific features for model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4200–4214, 2025. 13

  55. [55]

    To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging

    Zitao Fang, Guodong Du, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, and Sim Kuan Goh. To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15731–15751, 2025

  56. [56]

    Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

    Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

  57. [57]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  58. [58]

    Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024

    Shaoxiong Ji, Zihao Li, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O’Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, and Barry Haddow. Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024

  59. [59]

    Qwen2 technical report. 2024

  60. [60]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  61. [61]

    AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

    Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani. Afriquellm: How data mixing and model architecture impact continued pre-training for african languages.arXiv preprint arXiv:2601.06395, 2026

  62. [62]

    Xcopa: A multilingual dataset for causal commonsense reasoning

    Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vuli ´c, and Anna Korhonen. Xcopa: A multilingual dataset for causal commonsense reasoning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, 2020

  63. [63]

    Few-shot learning with multilingual generative language models

    Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. Few-shot learning with multilingual generative language models. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 9019–9052, 2022

  64. [64]

    Xnli: Evaluating cross-lingual sentence representations

    Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2475–2485, 2018

  65. [65]

    Visually grounded reasoning across languages and cultures

    Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467– 10485, 2021

  66. [66]

    Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026

    Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesu- joba Oluwadara Alabi, Muhidin A Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, et al. Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026

  67. [67]

    Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  68. [68]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024. 14

  69. [69]

    Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

    Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

  70. [70]

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782, 2024

  71. [71]

    Lmms-eval: Reality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 15 A Baselines and Benchmarks All experiments were...