DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

Daling Wang; Ercong Nie; Hinrich Sch\"utze; Mengjie Zhao; Mingyang Wang; Shi Feng; Xiaocui Yang; Yongkang Liu; Zijing Wang

arxiv: 2605.12960 · v2 · pith:KD37OJPHnew · submitted 2026-05-13 · 💻 cs.CL

DiMtextsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

Zijing Wang , Mingyang Wang , Ercong Nie , Yongkang Liu , Shi Feng , Mengjie Zhao , Daling Wang , Xiaocui Yang

show 1 more author

Hinrich Sch\"utze

This is my paper

Pith reviewed 2026-05-21 09:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual multimodal mergingdirection and magnitude awaretraining-free adaptationresidual update compositioncross-lingual alignmentvision-language modelsmodel mergingparameter selective composition

0 comments

The pith

DiM3 adds multilingual capabilities to multimodal models by selectively merging residual updates based on direction and magnitude at each parameter.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DiM3 as a training-free approach to equip existing multimodal models with support for many languages. It does this by composing updates from multilingual and multimodal training in the shared language-model backbone while leaving the vision encoder and projector unchanged. The method decides how to combine the updates at every parameter dimension according to their directions and magnitudes to limit interference. Experiments across LLaVA- and Qwen-based models and 57 languages show gains over standard merging methods, better multilingual results than the original multimodal model, and performance close to full multilingual multimodal fine-tuning. The same procedure can be applied to already multilingual multimodal models for extra improvement and primarily affects intermediate-layer representations to strengthen cross-lingual alignment.

Core claim

DiM3 bridges multilingual and multimodal models by direction- and magnitude-aware merging of residual updates in the language model backbone. This selective composition at each parameter dimension preserves the vision encoder and projector while enhancing multilingual performance across text-only and vision-language tasks, as shown in experiments on LLaVA- and Qwen-based models covering 57 languages.

What carries the argument

Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which analyzes the direction and magnitude of the two residual updates to set per-dimension composition weights and thereby reduce destructive interference in the shared parameters.

Load-bearing premise

Multilingual and multimodal residual updates differ enough in direction and magnitude that a selective per-dimension rule can combine them without destructive interference.

What would settle it

Running the same benchmarks with uniform averaging of the two residual updates instead of the direction-and-magnitude rule and obtaining equal or higher multilingual accuracy while preserving multimodal scores would falsify the value of the selective rule.

Figures

Figures reproduced from arXiv: 2605.12960 by Daling Wang, Ercong Nie, Hinrich Sch\"utze, Mengjie Zhao, Mingyang Wang, Shi Feng, Xiaocui Yang, Yongkang Liu, Zijing Wang.

**Figure 1.** Figure 1: Residual heterogeneity in the shared language model backbone. The panels show residual norm, base-relative reorientation, and cross-residual alignment for ∆ml and ∆mm across layers and modules. Together, these diagnostics reveal that multilingual and multimodal adaptations differ in both update magnitude and geometry, motivating selective rather than uniform composition. same backbone to cooperate with pro… view at source ↗

**Figure 2.** Figure 2: Results on three general multimodal benchmarks [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE visualizations of average-pooled hidden states on multilingual text inputs from XNLI [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-wise silhouette scores of multilingual hidden-state representations under multilingual [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE visualizations of average-pooled hidden states for the question spans in multilingual [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Additional t-SNE visualizations of average-pooled hidden states on multilingual text inputs [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Full layer-wise t-SNE visualizations of average-pooled hidden states on multilingual text [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Full layer-wise t-SNE visualizations of average-pooled hidden states on multilingual text [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Full layer-wise t-SNE visualizations of average-pooled hidden states on multilingual text [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiM3 gives a practical per-dimension merging rule that beats standard baselines when adding multilingual updates to multimodal backbones, with solid but not fully detailed experiments.

read the letter

Hi colleague, the main thing to know is that DiM3 uses direction and magnitude to decide how to combine multilingual and multimodal residual updates at each parameter dimension, and this selective rule produces better multilingual results than the merging methods it compares against. The paper does a good job running the idea on LLaVA and Qwen backbones across 57 languages in both text-only and vision-language settings. It reports consistent gains over baselines, clear improvement over the original multimodal model, and performance close to dedicated multilingual multimodal fine-tuning while mostly keeping general multimodal ability. They also show extra gains when the method is applied on top of already-trained multilingual multimodal models. The interpretability section on how it reshapes intermediate-layer representations adds some useful insight, and the code release makes the work easier to check. The soft spots are mostly in the reporting and scope. There are no details on run-to-run variance or statistical significance, which leaves the reliability of the gains a little unclear. The approach assumes the updates are heterogeneous enough to merge without much interference, and the results support that here, but more cases where the assumption breaks would help. Tests stay within two model families, so broader applicability is not fully mapped. This paper is useful for people working on model merging or cheap ways to extend multimodal models to more languages. A reader focused on efficient adaptation techniques would get value from the direct comparisons and the concrete merging rule. I would send it for peer review because the empirical claims are grounded in held-out benchmarks and the method is clearly specified.

Referee Report

2 major / 2 minor

Summary. The paper proposes DiM³, a training-free method for composing multilingual and multimodal residual updates via a per-dimension direction- and magnitude-aware rule applied to the shared language-model backbone of models such as LLaVA and Qwen. Experiments across text-only and vision-language multilingual benchmarks spanning 57 languages show that DiM³ outperforms existing merging baselines, substantially improves multilingual performance relative to the original multimodal model, remains competitive with dedicated multilingual multimodal fine-tuning, largely retains general multimodal ability, and can yield further gains when applied to already-trained multilingual multimodal models. Interpretability analysis indicates that the method primarily reshapes intermediate-layer semantic representations to strengthen cross-lingual alignment while preserving higher-layer task structure.

Significance. If the empirical results hold, the work demonstrates a practical, low-cost route to extending multimodal models to many languages without constructing large multilingual multimodal datasets or performing end-to-end retraining. The breadth of evaluation (multiple backbones, 57 languages, both text-only and vision-language settings, plus comparisons to baselines and full fine-tuning) and the public code repository constitute clear strengths for reproducibility and adoption.

major comments (2)

[Abstract / Experiments] Abstract and experimental results: the reported performance gains are presented without accompanying information on statistical significance, standard deviations across multiple runs, or the precise hyperparameter values used for the merging coefficients; these omissions make it difficult to judge the robustness of the central claim that DiM³ consistently outperforms baselines across 57 languages.
[Method] Method description: the selective per-dimension composition rule presupposes sufficient heterogeneity between multilingual and multimodal residual updates to avoid destructive interference, yet the manuscript provides no quantitative diagnostic (e.g., cosine similarity or magnitude histograms per layer) that would allow readers to verify when this premise holds.

minor comments (2)

[Interpretability analysis] Figure captions and axis labels in the interpretability plots could be expanded to clarify which layers correspond to the reported semantic-alignment improvements.
[Method] A brief statement of the exact number of parameters updated by each residual (multilingual vs. multimodal) would help readers assess the scale of the merging operation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We address the two major comments point by point below, with revisions planned where they strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental results: the reported performance gains are presented without accompanying information on statistical significance, standard deviations across multiple runs, or the precise hyperparameter values used for the merging coefficients; these omissions make it difficult to judge the robustness of the central claim that DiM³ consistently outperforms baselines across 57 languages.

Authors: We thank the referee for highlighting this. DiM³ is a fully deterministic, training-free procedure: given fixed residual updates and fixed coefficients, the output is identical across runs, so standard deviations from repeated executions do not apply in the manner they do for stochastic fine-tuning. We will nevertheless strengthen the presentation by adding a table (or appendix) that reports the exact numerical values of all merging coefficients (direction- and magnitude-scaling factors) used in every experiment and backbone. For statistical significance, the breadth of the 57-language evaluation already shows consistent directional gains; we will add a short note on cross-language consistency in the revised text. These changes address the robustness concern while remaining proportionate to a minor revision. revision: partial
Referee: [Method] Method description: the selective per-dimension composition rule presupposes sufficient heterogeneity between multilingual and multimodal residual updates to avoid destructive interference, yet the manuscript provides no quantitative diagnostic (e.g., cosine similarity or magnitude histograms per layer) that would allow readers to verify when this premise holds.

Authors: We agree that explicit diagnostics would help readers assess the heterogeneity premise. In the revised manuscript we will insert a short analysis subsection (or appendix) that reports (i) layer-wise cosine similarities between the multilingual and multimodal residual vectors and (ii) per-layer magnitude histograms or summary statistics. These quantities are already computable from the updates we used and will be added without new experiments. The added material will directly illustrate the degree of directional and magnitude divergence that motivates the per-dimension rule. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes an explicit, training-free merging rule (DiM3) that selectively composes independently obtained residual updates based on per-dimension direction and magnitude. All performance claims are empirical measurements on held-out benchmarks (multilingual text and vision-language tasks across 57 languages, LLaVA/Qwen backbones) rather than quantities derived from the merging equations themselves. No self-citation chains, fitted parameters renamed as predictions, or self-definitional loops appear in the method definition or central results. The composition rule is defined directly from the updates and tested independently, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method relies on the empirical observation that multilingual and multimodal updates occupy different directions and magnitudes in parameter space; no new mathematical axioms or invented physical entities are introduced. The only potential free parameters are the per-layer or per-dimension weighting coefficients that implement the selective composition, but these are not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5818 in / 1307 out tokens · 49156 ms · 2026-05-21T09:07:33.371434+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use both δmag k,j and δdir k,j to estimate source salience... ωk,j = ½(smag k,j + sdir k,j) ... fW:,j = WN,:,j + Σ ωk,j Δ(W)k:,j

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 10 internal anchors

[1]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

xgqa: Cross-lingual visual question answering

Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin O Steitz, Stefan Roth, Ivan Vuli´c, and Iryna Gurevych. xgqa: Cross-lingual visual question answering. InFindings of the association for computational linguistics: ACL 2022, pages 2497–2511, 2022

work page 2022
[6]

Maxm: Towards multilingual visual question answering

Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. Maxm: Towards multilingual visual question answering. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2667–2682, 2023

work page 2023
[7]

M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks

Florian Schneider and Sunayana Sitaram. M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4309–4345, 2024

work page 2024
[8]

Cvqa: Culturally-diverse multilingual visual question answering benchmark

David Orlando Romero Mogrovejo, Chenyang Lyu, Haryo Akbarianto Wibowo, Santiago Gón- gora, Aishik Mandal, Sukannya Purkayastha, Jesus-German Ortiz-Barajas, Emilio Villa Cueva, Jinheon Baek, Soyeong Jeong, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. InThe Thirty-eight Conference on Neural Information Processing Systems...

work page 2024
[9]

Pangea: A fully open multilingual multimodal llm for 39 languages

Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, and Graham Neu- big. Pangea: A fully open multilingual multimodal llm for 39 languages. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[10]

Parrot: Multilingual visual instruction tuning

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, et al. Parrot: Multilingual visual instruction tuning. In Forty-second International Conference on Machine Learning

work page
[11]

mblip: Efficient bootstrapping of multilingual vision-llms

Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. mblip: Efficient bootstrapping of multilingual vision-llms. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 7–25, 2024. 10

work page 2024
[12]

Unlocking the potential of model merging for low-resource languages

Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, and Yansong Feng. Unlocking the potential of model merging for low-resource languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8705–8720, Miami, Florida, USA, November

work page 2024
[13]

Association for Computational Linguistics

work page
[14]

Rossi, and Thien Huu Nguyen

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings o...

work page 2024
[15]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

work page 2022
[16]

Model tailor: mitigating catastrophic forgetting in multi-modal large language models

Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Chao Wu, and Kun Kuang. Model tailor: mitigating catastrophic forgetting in multi-modal large language models. In Proceedings of the 41st International Conference on Machine Learning, pages 62581–62598, 2024

work page 2024
[17]

Sens-merging: Sensitivity-guided parameter balancing for merging large language models

Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, and Linqi Song. Sens-merging: Sensitivity-guided parameter balancing for merging large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19243–19255, 2025

work page 2025
[18]

Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

work page 2024
[19]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations

work page
[20]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024

work page 2024
[21]

Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

work page 2023
[22]

One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025

Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, and Jingbo Zhu. One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025

work page 2025
[23]

Understanding cross-lingual alignment—a survey

Katharina Hämmerl, Jindˇrich Libovick`y, and Alexander Fraser. Understanding cross-lingual alignment—a survey. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10922–10943, 2024

work page 2024
[24]

mt5: A massively multilingual pre-trained text-to-text trans- former

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text trans- former. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 483–498, 2021

work page 2021
[25]

Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly

Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen, Jixing Li, and Shujian Huang. Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...

work page 2024
[26]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[27]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[28]

Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

work page 2016
[29]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[30]

The geometry of multilingual language model representations

Tyler Chang, Zhuowen Tu, and Benjamin Bergen. The geometry of multilingual language model representations. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 119–136, 2022

work page 2022
[31]

Language surgery in multilingual large language models

Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, and Samuel Cahyawijaya. Language surgery in multilingual large language models. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 438–467, 2025

work page 2025
[32]

Do llamas work in english? on the latent language of multilingual transformers

Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in english? on the latent language of multilingual transformers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366–15394, 2024

work page 2024
[33]

Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026

Yuxin Chen, Zhengzhou Cai, Xiangtian Ji, Weixiang Zhao, An Zhang, Xiang Wang, and Tat- Seng Chua. Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026

work page arXiv 2026
[34]

From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment

Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, and Xiaodong Shi. From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28956–28974, 2025

work page 2025
[35]

Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality

Mengyu Bu and Yang Feng. Language on demand, knowledge at core: Composing llms with encoder-decoder translation models for extensible multilinguality.arXiv preprint arXiv:2603.17512, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Alignx: Advancing multilingual large language models with multilingual representation alignment

Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, and Yang Feng. Alignx: Advancing multilingual large language models with multilingual representation alignment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6471– 6500, 2025

work page 2025
[37]

How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024

Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024

work page 2024
[38]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

work page 2022
[39]

Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026

Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, et al. Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026

work page arXiv 2026
[40]

MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

Xinhan Zheng, Huyu Wu, Xueting Wang, and Haiyun Jiang. Unveiling intrinsic text bias in multimodal large language models through attention key-space analysis.arXiv preprint arXiv:2510.26721, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Palo: A polyglot large multimodal model for 5b people

Hanoona Rasheed, Muhammad Maaz, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Tim Baldwin, Michael Felsberg, and Fahad S Khan. Palo: A polyglot large multimodal model for 5b people. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1745–1754, 2025

work page 2025
[42]

Centurio: On drivers of multilingual ability of large vision- language model

Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, and Goran Glavaš. Centurio: On drivers of multilingual ability of large vision- language model. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2831–2881, 2025

work page 2025
[43]

Breaking language barriers in visual language models via multilingual textual regularization

Iñigo Pikabea, Iñaki Lacunza, Oriol Pareras Velasco, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, and Marta Villegas. Breaking language barriers in visual language models via multilingual textual regularization. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Ch...

work page 2025
[44]

Language-specific layer matters: Efficient multilingual enhancement for large vision-language models

Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, and Jingbo Zhu. Language-specific layer matters: Efficient multilingual enhancement for large vision-language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 12473–12500, 2025

work page 2025
[45]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

work page 2026
[46]

Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025

Zijing Wang, Xingle Xu, Yongkang Liu, Yiqun Zhang, Peiqin Lin, Shi Feng, Xiaocui Yang, Daling Wang, and Hinrich Schütze. Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025

work page arXiv 2025
[47]

Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025

Zijing Wang, Yongkang Liu, Yingfeng Luo, Ming Wang, Zhen Song, Shi Feng, Xiaocui Yang, Dingyang Lin, Daling Wang, Yifei Zhang, et al. Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025

work page 2025
[48]

Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

work page arXiv 2024
[49]

Whoever started the interference should end it: Guiding data-free model merging via task vectors

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InForty-second International Conference on Machine Learning

work page
[50]

Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization

Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, et al. Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9413–9422, 2025

work page 2025
[51]

Model breadcrumbs: Scaling multi-task model merging with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InEuropean Conference on Computer Vision, pages 270–287. Springer, 2024

work page 2024
[52]

Magmax: Leveraging model merging for seamless continual learning

Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024

work page 2024
[53]

Parameter competition balancing for model merging

Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. Parameter competition balancing for model merging. Advances in Neural Information Processing Systems, 37:84746–84776, 2024

work page 2024
[54]

Superpose task-specific features for model merging

Haiquan Qiu, You Wu, Dong Li, Jianmin Guo, and Quanming Yao. Superpose task-specific features for model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4200–4214, 2025. 13

work page 2025
[55]

To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging

Zitao Fang, Guodong Du, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, and Sim Kuan Goh. To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15731–15751, 2025

work page 2025
[56]

Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

work page arXiv 2025
[57]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024

Shaoxiong Ji, Zihao Li, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O’Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, and Barry Haddow. Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024

work page arXiv 2024
[59]

Qwen2 technical report. 2024

work page 2024
[60]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani. Afriquellm: How data mixing and model architecture impact continued pre-training for african languages.arXiv preprint arXiv:2601.06395, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

Xcopa: A multilingual dataset for causal commonsense reasoning

Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vuli ´c, and Anna Korhonen. Xcopa: A multilingual dataset for causal commonsense reasoning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, 2020

work page 2020
[63]

Few-shot learning with multilingual generative language models

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. Few-shot learning with multilingual generative language models. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 9019–9052, 2022

work page 2022
[64]

Xnli: Evaluating cross-lingual sentence representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2475–2485, 2018

work page 2018
[65]

Visually grounded reasoning across languages and cultures

Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467– 10485, 2021

work page 2021
[66]

Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026

Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesu- joba Oluwadara Alabi, Muhidin A Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, et al. Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026

work page arXiv 2026
[67]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024
[68]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024. 14

work page 2024
[69]

Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

work page arXiv 2024
[70]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 15 A Baselines and Benchmarks All experiments were...

work page arXiv 2025

[1] [1]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

xgqa: Cross-lingual visual question answering

Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin O Steitz, Stefan Roth, Ivan Vuli´c, and Iryna Gurevych. xgqa: Cross-lingual visual question answering. InFindings of the association for computational linguistics: ACL 2022, pages 2497–2511, 2022

work page 2022

[6] [6]

Maxm: Towards multilingual visual question answering

Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. Maxm: Towards multilingual visual question answering. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2667–2682, 2023

work page 2023

[7] [7]

M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks

Florian Schneider and Sunayana Sitaram. M5–a diverse benchmark to assess the performance of large multimodal models across multilingual and multicultural vision-language tasks. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4309–4345, 2024

work page 2024

[8] [8]

Cvqa: Culturally-diverse multilingual visual question answering benchmark

David Orlando Romero Mogrovejo, Chenyang Lyu, Haryo Akbarianto Wibowo, Santiago Gón- gora, Aishik Mandal, Sukannya Purkayastha, Jesus-German Ortiz-Barajas, Emilio Villa Cueva, Jinheon Baek, Soyeong Jeong, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. InThe Thirty-eight Conference on Neural Information Processing Systems...

work page 2024

[9] [9]

Pangea: A fully open multilingual multimodal llm for 39 languages

Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, and Graham Neu- big. Pangea: A fully open multilingual multimodal llm for 39 languages. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024

[10] [10]

Parrot: Multilingual visual instruction tuning

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, et al. Parrot: Multilingual visual instruction tuning. In Forty-second International Conference on Machine Learning

work page

[11] [11]

mblip: Efficient bootstrapping of multilingual vision-llms

Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. mblip: Efficient bootstrapping of multilingual vision-llms. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 7–25, 2024. 10

work page 2024

[12] [12]

Unlocking the potential of model merging for low-resource languages

Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, and Yansong Feng. Unlocking the potential of model merging for low-resource languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8705–8720, Miami, Florida, USA, November

work page 2024

[13] [13]

Association for Computational Linguistics

work page

[14] [14]

Rossi, and Thien Huu Nguyen

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings o...

work page 2024

[15] [15]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

work page 2022

[16] [16]

Model tailor: mitigating catastrophic forgetting in multi-modal large language models

Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Chao Wu, and Kun Kuang. Model tailor: mitigating catastrophic forgetting in multi-modal large language models. In Proceedings of the 41st International Conference on Machine Learning, pages 62581–62598, 2024

work page 2024

[17] [17]

Sens-merging: Sensitivity-guided parameter balancing for merging large language models

Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, and Linqi Song. Sens-merging: Sensitivity-guided parameter balancing for merging large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19243–19255, 2025

work page 2025

[18] [18]

Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

work page 2024

[19] [19]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations

work page

[20] [20]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024

work page 2024

[21] [21]

Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

work page 2023

[22] [22]

One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025

Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, and Jingbo Zhu. One size does not fit all: A distribution-aware sparsification for more precise model merging, 2025

work page 2025

[23] [23]

Understanding cross-lingual alignment—a survey

Katharina Hämmerl, Jindˇrich Libovick`y, and Alexander Fraser. Understanding cross-lingual alignment—a survey. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10922–10943, 2024

work page 2024

[24] [24]

mt5: A massively multilingual pre-trained text-to-text trans- former

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text trans- former. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 483–498, 2021

work page 2021

[25] [25]

Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly

Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen, Jixing Li, and Shujian Huang. Multilin- gual pretraining and instruction tuning improve cross-lingual knowledge alignment, but only shallowly. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...

work page 2024

[26] [26]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023

[27] [27]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[28] [28]

Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

work page 2016

[29] [29]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[30] [30]

The geometry of multilingual language model representations

Tyler Chang, Zhuowen Tu, and Benjamin Bergen. The geometry of multilingual language model representations. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 119–136, 2022

work page 2022

[31] [31]

Language surgery in multilingual large language models

Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, and Samuel Cahyawijaya. Language surgery in multilingual large language models. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 438–467, 2025

work page 2025

[32] [32]

Do llamas work in english? on the latent language of multilingual transformers

Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in english? on the latent language of multilingual transformers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366–15394, 2024

work page 2024

[33] [33]

Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026

Yuxin Chen, Zhengzhou Cai, Xiangtian Ji, Weixiang Zhao, An Zhang, Xiang Wang, and Tat- Seng Chua. Understanding multilingualism in mixture-of-experts llms: Routing mechanism, expert specialization, and layerwise steering.arXiv preprint arXiv:2601.14050, 2026

work page arXiv 2026

[34] [34]

From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment

Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, and Xiaodong Shi. From neurons to semantics: Evaluating cross-linguistic alignment capabilities of large language models via neurons alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28956–28974, 2025

work page 2025

[35] [35]

Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality

Mengyu Bu and Yang Feng. Language on demand, knowledge at core: Composing llms with encoder-decoder translation models for extensible multilinguality.arXiv preprint arXiv:2603.17512, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Alignx: Advancing multilingual large language models with multilingual representation alignment

Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, and Yang Feng. Alignx: Advancing multilingual large language models with multilingual representation alignment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6471– 6500, 2025

work page 2025

[37] [37]

How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024

Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. How do large language models handle multilingualism?Advances in Neural Information Processing Systems, 37:15296–15319, 2024

work page 2024

[38] [38]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

work page 2022

[39] [39]

Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026

Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, et al. Plam: Training-free plateau-guided model merging for better visual grounding in mllms.arXiv preprint arXiv:2601.07645, 2026

work page arXiv 2026

[40] [40]

MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

Xinhan Zheng, Huyu Wu, Xueting Wang, and Haiyun Jiang. Unveiling intrinsic text bias in multimodal large language models through attention key-space analysis.arXiv preprint arXiv:2510.26721, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Palo: A polyglot large multimodal model for 5b people

Hanoona Rasheed, Muhammad Maaz, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Tim Baldwin, Michael Felsberg, and Fahad S Khan. Palo: A polyglot large multimodal model for 5b people. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1745–1754, 2025

work page 2025

[42] [42]

Centurio: On drivers of multilingual ability of large vision- language model

Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, and Goran Glavaš. Centurio: On drivers of multilingual ability of large vision- language model. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2831–2881, 2025

work page 2025

[43] [43]

Breaking language barriers in visual language models via multilingual textual regularization

Iñigo Pikabea, Iñaki Lacunza, Oriol Pareras Velasco, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, and Marta Villegas. Breaking language barriers in visual language models via multilingual textual regularization. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Ch...

work page 2025

[44] [44]

Language-specific layer matters: Efficient multilingual enhancement for large vision-language models

Yuchun Fan, Yilin Wang, Yongyu Mu, Lei Huang, Bei Li, Xiaocheng Feng, Tong Xiao, and Jingbo Zhu. Language-specific layer matters: Efficient multilingual enhancement for large vision-language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 12473–12500, 2025

work page 2025

[45] [45]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportu- nities.ACM Computing Surveys, 58(8):1–41, 2026

work page 2026

[46] [46]

Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025

Zijing Wang, Xingle Xu, Yongkang Liu, Yiqun Zhang, Peiqin Lin, Shi Feng, Xiaocui Yang, Daling Wang, and Hinrich Schütze. Why do more experts fail? a theoretical analysis of model merging.arXiv preprint arXiv:2505.21226, 2025

work page arXiv 2025

[47] [47]

Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025

Zijing Wang, Yongkang Liu, Yingfeng Luo, Ming Wang, Zhen Song, Shi Feng, Xiaocui Yang, Dingyang Lin, Daling Wang, Yifei Zhang, et al. Scaling intelligence through model merging: A comprehensive survey.Authorea Preprints, 2025

work page 2025

[48] [48]

Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, and Han Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.arXiv preprint arXiv:2408.13656, 2024

work page arXiv 2024

[49] [49]

Whoever started the interference should end it: Guiding data-free model merging via task vectors

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. InForty-second International Conference on Machine Learning

work page

[50] [50]

Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization

Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, et al. Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9413–9422, 2025

work page 2025

[51] [51]

Model breadcrumbs: Scaling multi-task model merging with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InEuropean Conference on Computer Vision, pages 270–287. Springer, 2024

work page 2024

[52] [52]

Magmax: Leveraging model merging for seamless continual learning

Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. InEuropean Conference on Computer Vision, pages 379–395. Springer, 2024

work page 2024

[53] [53]

Parameter competition balancing for model merging

Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. Parameter competition balancing for model merging. Advances in Neural Information Processing Systems, 37:84746–84776, 2024

work page 2024

[54] [54]

Superpose task-specific features for model merging

Haiquan Qiu, You Wu, Dong Li, Jianmin Guo, and Quanming Yao. Superpose task-specific features for model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4200–4214, 2025. 13

work page 2025

[55] [55]

To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging

Zitao Fang, Guodong Du, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, and Sim Kuan Goh. To see a world in a spark of neuron: Disentangling multi-task interference for training-free model merging. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15731–15751, 2025

work page 2025

[56] [56]

Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

work page arXiv 2025

[57] [57]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024

Shaoxiong Ji, Zihao Li, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O’Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, and Barry Haddow. Emma-500: Enhancing massively multilingual adaptation of large language models.arXiv preprint arXiv:2409.17892, 2024

work page arXiv 2024

[59] [59]

Qwen2 technical report. 2024

work page 2024

[60] [60]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani. Afriquellm: How data mixing and model architecture impact continued pre-training for african languages.arXiv preprint arXiv:2601.06395, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[62] [62]

Xcopa: A multilingual dataset for causal commonsense reasoning

Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vuli ´c, and Anna Korhonen. Xcopa: A multilingual dataset for causal commonsense reasoning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, 2020

work page 2020

[63] [63]

Few-shot learning with multilingual generative language models

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. Few-shot learning with multilingual generative language models. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 9019–9052, 2022

work page 2022

[64] [64]

Xnli: Evaluating cross-lingual sentence representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2475–2485, 2018

work page 2018

[65] [65]

Visually grounded reasoning across languages and cultures

Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467– 10485, 2021

work page 2021

[66] [66]

Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026

Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesu- joba Oluwadara Alabi, Muhidin A Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, et al. Afri-mcqa: Multimodal cultural question answering for african languages.arXiv preprint arXiv:2601.05699, 2026

work page arXiv 2026

[67] [67]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024

[68] [68]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024. 14

work page 2024

[69] [69]

Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

work page arXiv 2024

[70] [70]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 15 A Baselines and Benchmarks All experiments were...

work page arXiv 2025