arxiv: 2605.12960 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

DiMtextsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

Zijing Wang , Mingyang Wang , Ercong Nie , Yongkang Liu , Shi Feng , Mengjie Zhao , Daling Wang , Xiaocui Yang

show 1 more author

Hinrich Sch\"utze

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords model mergingmultilingual multimodalresidual updatesdirection-awaremagnitude-awaretraining-freecross-lingual alignment

0 comments

The pith

Selective merging of direction- and magnitude-aware residual updates injects multilingual capability into multimodal models without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a way to add multilingual skills to an existing multimodal model by merging residual updates from a multilingual model and the multimodal model in their shared backbone. The merging uses awareness of both the direction and the magnitude of the updates at each parameter dimension to handle their different roles. This approach is training-free and leaves the vision encoder and projector untouched. Tests across 57 languages on different model families show it beats other merging methods, greatly improves multilingual performance, and stays close to full fine-tuning results while keeping most of the original multimodal performance. Analysis reveals it mainly changes middle-layer representations to improve alignment across languages.

Core claim

DiM3 selectively composes multilingual and multimodal residual updates at each parameter dimension using direction and magnitude awareness, allowing the creation of a model with both capabilities from existing models without any additional training.

What carries the argument

Direction- and magnitude-aware merging of residual updates, which decides the composition ratio per dimension to preserve useful features from each update.

If this is right

DiM3 outperforms existing merging baselines on multilingual benchmarks.
It substantially boosts multilingual performance compared to the original multimodal model.
It remains competitive with dedicated multilingual multimodal fine-tuning while retaining general multimodal ability.
It can be applied to already-trained multilingual multimodal models to gain further improvements.
It reshapes intermediate-layer semantic representations to strengthen cross-lingual alignment under text and multimodal inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This selective merging might be useful for combining other types of model adaptations beyond language and vision.
Reducing the cost of building multilingual multimodal systems could accelerate development of more inclusive AI tools.
The focus on intermediate layers suggests that alignment tasks in models may be best addressed at mid-depth representations.

Load-bearing premise

The multilingual and multimodal residual updates differ in their directions and magnitudes in a way that allows selective composition without causing interference in the shared parameters.

What would settle it

If the DiM3-merged model shows lower performance than the original multimodal model on multimodal tasks or lower than a multilingual model on language tasks, that would indicate the merging does not work as claimed.

Figures

Figures reproduced from arXiv: 2605.12960 by Daling Wang, Ercong Nie, Hinrich Sch\"utze, Mengjie Zhao, Mingyang Wang, Shi Feng, Xiaocui Yang, Yongkang Liu, Zijing Wang.

**Figure 1.** Figure 1: Residual heterogeneity in the shared language model backbone. The panels show residual norm, base-relative reorientation, and cross-residual alignment for ∆ml and ∆mm across layers and modules. Together, these diagnostics reveal that multilingual and multimodal adaptations differ in both update magnitude and geometry, motivating selective rather than uniform composition. same backbone to cooperate with pro… view at source ↗

**Figure 2.** Figure 2: Results on three general multimodal benchmarks [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE visualizations of average-pooled hidden states on multilingual text inputs from XNLI [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-wise silhouette scores of multilingual hidden-state representations under multilingual [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE visualizations of average-pooled hidden states for the question spans in multilingual [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Additional t-SNE visualizations of average-pooled hidden states on multilingual text inputs [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Full layer-wise t-SNE visualizations of average-pooled hidden states on multilingual text [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Full layer-wise t-SNE visualizations of average-pooled hidden states on multilingual text [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Full layer-wise t-SNE visualizations of average-pooled hidden states on multilingual text [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiM3 gives a workable training-free route to add multilingual coverage to existing multimodal models via selective update merging, with broad but not fully dissected experiments.

read the letter

The core result is that you can take a multilingual residual update and a multimodal one, then merge them into a shared language-model backbone by selecting per dimension on cosine direction and norm magnitude. This beats standard merging baselines on multilingual benchmarks, lifts performance over the original multimodal model, and stays close to full multilingual multimodal fine-tuning while mostly preserving general vision-language ability. The same rule even produces extra gains when dropped onto already-trained multilingual multimodal models. Experiments span 57 languages on LLaVA and Qwen backbones in both text-only and vision-language settings, with an interpretability section showing improved cross-lingual alignment in intermediate layers.

Referee Report

1 major / 1 minor

Summary. The paper introduces DiM³, a training-free method for merging multilingual and multimodal residual updates in the shared language model backbone of existing multimodal models. By selectively composing updates per parameter dimension based on direction (cosine similarity) and magnitude (norm) signals, DiM³ aims to preserve the original vision encoder and multimodal projector while enhancing multilingual capabilities. Experiments across LLaVA- and Qwen-based backbones on benchmarks covering 57 languages demonstrate that DiM³ outperforms existing merging baselines, substantially improves multilingual performance over the original model, remains competitive with dedicated fine-tuning, and can be applied to already multilingual multimodal models for additional gains. Interpretability analysis indicates that DiM³ primarily affects intermediate-layer semantic representations to strengthen cross-lingual alignment under text-only and multimodal inputs.

Significance. If the results hold, DiM³ provides an efficient, training-free approach to bridge multilingual and multimodal capabilities in large models, reducing the need for expensive data construction and retraining. The method's applicability to multiple backbones and its ability to retain general multimodal abilities while improving multilingual performance make it potentially impactful for developing more general AI systems. The inclusion of interpretability analysis and extension to pre-trained multilingual multimodal models adds to its value, offering insights into how selective merging affects model representations.

major comments (1)

[Experiments and Interpretability Analysis] The central assumption that multilingual and multimodal residual updates are sufficiently heterogeneous to allow per-dimension direction/magnitude selection without destructive interference on multimodal inputs is load-bearing but only indirectly supported. The interpretability analysis reports improved cross-lingual alignment in intermediate layers, yet no ablation replaces the selection rule with uniform averaging and measures representation drift or task performance specifically on combined text+image inputs (see Experiments and Interpretability sections).

minor comments (1)

[Abstract] Notation for the method name is rendered as DiM³ in the title but appears as DiM^{3} in the abstract; ensure consistent superscript formatting throughout.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The central concern about directly validating the heterogeneity assumption via ablation on multimodal inputs is well-taken, and we will strengthen the manuscript accordingly while preserving the original claims supported by existing results.

read point-by-point responses

Referee: [Experiments and Interpretability Analysis] The central assumption that multilingual and multimodal residual updates are sufficiently heterogeneous to allow per-dimension direction/magnitude selection without destructive interference on multimodal inputs is load-bearing but only indirectly supported. The interpretability analysis reports improved cross-lingual alignment in intermediate layers, yet no ablation replaces the selection rule with uniform averaging and measures representation drift or task performance specifically on combined text+image inputs (see Experiments and Interpretability sections).

Authors: We agree that an explicit ablation replacing the direction/magnitude selection with uniform averaging, evaluated on representation drift and task performance under combined text+image inputs, would provide more direct support for the heterogeneity assumption. The current manuscript shows that DiM³ largely retains multimodal benchmark performance (VQA, captioning, etc.) while improving multilingual results across 57 languages, and the interpretability analysis indicates preserved higher-layer task structure; these results are consistent with limited destructive interference. To address the point directly, we will add the requested ablation in the revised Experiments and Interpretability sections, reporting cosine similarity drift and downstream multimodal task metrics for both the selective rule and uniform averaging. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines DiM3 as an independent merging procedure that selectively composes residual updates using per-dimension cosine similarity and norm signals. This construction is stated directly from the heterogeneity assumption and is validated on external multilingual and multimodal benchmarks (57 languages, LLaVA/Qwen backbones) without any reduction of the central claim to fitted parameters, self-citations, or renamed known results. No equations or steps in the provided text equate outputs to inputs by construction; the method remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach builds on standard model merging assumptions but introduces selective composition based on update properties; no new entities postulated. Specific hyperparameters for direction and magnitude selection are likely present but not detailed in the abstract.

free parameters (1)

direction and magnitude selection thresholds
Parameters controlling how updates are selectively composed per dimension, though exact values not specified in abstract.

pith-pipeline@v0.9.0 · 5587 in / 1230 out tokens · 71751 ms · 2026-05-14T20:20:37.605222+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DiM3 performs selective composition rather than uniform aggregation, adaptively assigning their contributions based on local geometric importance in parameter space... ωk,j = 1/2 (smagk,j + sdirk,j)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 3.1 (Residual geometry)... ∥Wk,:,j − WN,:,j∥²₂ = (δmagk,j)² + 2mk(j)mN(j)δdirk,j

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 19 canonical work pages · 9 internal anchors

[1]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

xgqa: Cross-lingual visual question answering

Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin O Steitz, Stefan Roth, Ivan Vuli´c, and Iryna Gurevych. xgqa: Cross-lingual visual question answering. InFindings of the association for computational linguistics: ACL 2022, pages 2497–2511, 2022

2022
[6]

Maxm: Towards multilingual visual question answering

Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. Maxm: Towards multilingual visual question answering. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2667–2682, 2023

2023
[7]

M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks

Florian Schneider and Sunayana Sitaram. M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4309–4345, 2024

2024
[8]

Cvqa: Culturally-diverse multilingual visual question answering benchmark

David Orlando Romero Mogrovejo, Chenyang Lyu, Haryo Akbarianto Wibowo, Santiago Gón- gora, Aishik Mandal, Sukannya Purkayastha, Jesus-German Ortiz-Barajas, Emilio Villa Cueva, Jinheon Baek, Soyeong Jeong, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. InThe Thirty-eight Conference on Neural Information Processing Systems...

2024
[9]

Pangea: A fully open multilingual multimodal llm for 39 languages

Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, and Graham Neu- big. Pangea: A fully open multilingual multimodal llm for 39 languages. InThe Thirteenth International Conference on Learning Representations, 2024

2024
[10]

Parrot: Multilingual visual instruction tuning

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, et al. Parrot: Multilingual visual instruction tuning. In Forty-second International Conference on Machine Learning
[11]

mblip: Efficient bootstrapping of multilingual vision-llms

Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. mblip: Efficient bootstrapping of multilingual vision-llms. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 7–25, 2024. 10

2024
[12]

Unlocking the potential of model merging for low-resource languages

Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, and Yansong Feng. Unlocking the potential of model merging for low-resource languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8705–8720, Miami, Florida, USA, November

2024
[13]

Association for Computational Linguistics
[14]

Rossi, and Thien Huu Nguyen

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings o...

2024
[15]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

2022
[16]

Model tailor: mitigating catastrophic forgetting in multi-modal large language models

Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Chao Wu, and Kun Kuang. Model tailor: mitigating catastrophic forgetting in multi-modal large language models. In Proceedings of the 41st International Conference on Machine Learning, pages 62581–62598, 2024

2024
[17]

Sens-merging: Sensitivity-guided parameter balancing for merging large language models

Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, and Linqi Song. Sens-merging: Sensitivity-guided parameter balancing for merging large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19243–19255, 2025

2025
[18]

Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

2024
[19]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations
[20]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024

2024
[21]

Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

2023
[22]

One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025

Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, and Jingbo Zhu. One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025

2025
[23]

Understanding cross-lingual alignment—a survey

Katharina Hämmerl, Jindˇrich Libovick`y, and Alexander Fraser. Understanding cross-lingual alignment—a survey. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10922–10943, 2024

2024
[24]

mt5: A massively multilingual pre-trained text-to-text trans- former

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text trans- former. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 483–498, 2021

2021
[25]

Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly

Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen, Jixing Li, and Shujian Huang. Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...

2024
[26]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[27]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[28]

Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

2016
[29]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024

2024
[30]

The geometry of multilingual language model representations

Tyler Chang, Zhuowen Tu, and Benjamin Bergen. The geometry of multilingual language model representations. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 119–136, 2022

2022
[31]

Language surgery in multilingual large language models

Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, and Samuel Cahyawijaya. Language surgery in multilingual large language models. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 438–467, 2025

2025
[32]

Do llamas work in english? on the latent language of multilingual transformers

Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in english? on the latent language of multilingual transformers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366–15394, 2024

2024
[33]

Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026

Yuxin Chen, Zhengzhou Cai, Xiangtian Ji, Weixiang Zhao, An Zhang, Xiang Wang, and Tat- Seng Chua. Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026

work page arXiv 2026
[34]

From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment

Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, and Xiaodong Shi. From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28956–28974, 2025

2025
[35]

Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality

Mengyu Bu and Yang Feng. Language on demand, knowledge at core: Composing llms with encoder-decoder translation models for extensible multilinguality.arXiv preprint arXiv:2603.17512, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Alignx: Advancing multilingual large language models with multilingual representation alignment

Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, and Yang Feng. Alignx: Advancing multilingual large language models with multilingual representation alignment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6471– 6500, 2025

2025
[37]

How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024

Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024

2024
[38]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

2022
[39]

Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026

Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, et al. Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026

work page arXiv 2026
[40]

MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

Xinhan Zheng, Huyu Wu, Xueting Wang, and Haiyun Jiang. Unveiling intrinsic text bias in multimodal large language models through attention key-space analysis.arXiv preprint arXiv:2510.26721, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Palo: A polyglot large multimodal model for 5b people

Hanoona Rasheed, Muhammad Maaz, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Tim Baldwin, Michael Felsberg, and Fahad S Khan. Palo: A polyglot large multimodal model for 5b people. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1745–1754, 2025

2025
[42]

Centurio: On drivers of multilingual ability of large vision- language model

Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, and Goran Glavaš. Centurio: On drivers of multilingual ability of large vision- language model. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2831–2881, 2025

2025
[43]

Breaking language barriers in visual language models via multilingual textual regularization

Iñigo Pikabea, Iñaki Lacunza, Oriol Pareras Velasco, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, and Marta Villegas. Breaking language barriers in visual language models via multilingual textual regularization. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Ch...

2025
[44]

Language-specific layer matters: Efficient multilingual enhancement for large vision-language models

Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, and Jingbo Zhu. Language-specific layer matters: Efficient multilingual enhancement for large vision-language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 12473–12500, 2025

2025
[45]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

2026
[46]

Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025

Zijing Wang, Xingle Xu, Yongkang Liu, Yiqun Zhang, Peiqin Lin, Shi Feng, Xiaocui Yang, Daling Wang, and Hinrich Schütze. Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025

work page arXiv 2025
[47]

Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025

Zijing Wang, Yongkang Liu, Yingfeng Luo, Ming Wang, Zhen Song, Shi Feng, Xiaocui Yang, Dingyang Lin, Daling Wang, Yifei Zhang, et al. Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025

2025
[48]

Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

work page arXiv 2024
[49]

Whoever started the interference should end it: Guiding data-free model merging via task vectors

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InForty-second International Conference on Machine Learning
[50]

Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization

Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, et al. Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9413–9422, 2025

2025
[51]

Model breadcrumbs: Scaling multi-task model merging with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InEuropean Conference on Computer Vision, pages 270–287. Springer, 2024

2024
[52]

Magmax: Leveraging model merging for seamless continual learning

Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024

2024
[53]

Parameter competition balancing for model merging

Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. Parameter competition balancing for model merging. Advances in Neural Information Processing Systems, 37:84746–84776, 2024

2024
[54]

Superpose task-specific features for model merging

Haiquan Qiu, You Wu, Dong Li, Jianmin Guo, and Quanming Yao. Superpose task-specific features for model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4200–4214, 2025. 13

2025
[55]

To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging

Zitao Fang, Guodong Du, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, and Sim Kuan Goh. To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15731–15751, 2025

2025
[56]

Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

work page arXiv 2025
[57]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024

Shaoxiong Ji, Zihao Li, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O’Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, and Barry Haddow. Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024

work page arXiv 2024
[59]

Qwen2 technical report. 2024

2024
[60]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani. Afriquellm: How data mixing and model architecture impact continued pre-training for african languages.arXiv preprint arXiv:2601.06395, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

Xcopa: A multilingual dataset for causal commonsense reasoning

Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vuli ´c, and Anna Korhonen. Xcopa: A multilingual dataset for causal commonsense reasoning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, 2020

2020
[63]

Few-shot learning with multilingual generative language models

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. Few-shot learning with multilingual generative language models. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 9019–9052, 2022

2022
[64]

Xnli: Evaluating cross-lingual sentence representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2475–2485, 2018

2018
[65]

Visually grounded reasoning across languages and cultures

Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467– 10485, 2021

2021
[66]

Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026

Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesu- joba Oluwadara Alabi, Muhidin A Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, et al. Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026

work page arXiv 2026
[67]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024
[68]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024. 14

2024
[69]

Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

work page arXiv 2024
[70]

Biderman, H

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782, 2024

work page arXiv 2024
[71]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 15 A Baselines and Benchmarks All experiments were...

work page arXiv 2025