arxiv: 2604.02881 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging

Baban Gain , Asif Ekbal , Trilok Nath Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords model mergingmultilingual translationneuron selectivityrepresentational analysisfine-tuningmachine translation

0 comments

The pith

Merging of fine-tuned multilingual translation models degrades performance because fine-tuning redistributes language selectivity instead of sharpening it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines weight-space merging as a way to combine models fine-tuned on different language pairs for machine translation without retraining from scratch. Experiments show that such merging reduces translation quality, with greater losses when the target languages are not closely related. Through analysis of internal neuron activations and layer similarities, the study finds that fine-tuning does not make language representations more distinct but instead redistributes selectivity across neurons. This leads to greater differences in the higher layers responsible for output generation, explaining the merging failures. The results indicate that multilingual fine-tuning creates representation geometries incompatible with simple weight averaging.

Core claim

Our experiments reveal that merging degrades performance, especially when target languages differ. Critically, fine-tuning redistributes rather than sharpens language selectivity: neurons for supervised and related languages become less exclusive, while those for unsupervised languages grow more isolated. This redistribution increases representational divergence in higher layers that govern generation.

What carries the argument

Span-conditioned neuron selectivity combined with layer-wise centered kernel alignment (CKA), which tracks how language specificity redistributes during fine-tuning and measures resulting representational divergence.

If this is right

Performance degradation is more pronounced for dissimilar target languages.
Language-specific neurons are primarily in embedding layers and upper transformer blocks.
Fine-tuning causes reduced exclusivity for supervised languages' neurons.
Increased divergence in higher layers correlates with poorer generation after merging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alternative merging techniques that account for layer-specific divergence may be needed for multilingual settings.
The redistribution effect could limit merging success in other sequence generation tasks involving multiple languages.
Future work might explore selective merging of only lower layers where representations remain shared.

Load-bearing premise

The redistribution of neuron selectivity observed through selectivity measures and CKA is the primary driver of why merging fails, as opposed to other possible factors like training data differences or optimization details.

What would settle it

Measuring merging performance on models where selectivity redistribution is artificially prevented or matched, and seeing whether degradation still occurs.

read the original abstract

Weight-space model merging combines independently fine-tuned models without accessing original training data, offering a practical alternative to joint training. While merging succeeds in multitask settings, its behavior in multilingual contexts remains poorly understood. We systematically study weight-space merging for multilingual machine translation by fully fine-tuning language model on large-scale bilingual corpora and evaluating standard merging strategies. Our experiments reveal that merging degrades performance, especially when target languages differ. To explain this failure, we analyze internal representations using span-conditioned neuron selectivity and layer-wise centered kernel alignment. We find that language-specific neurons concentrate in embedding layers and upper transformer blocks, while intermediate layers remain largely shared across languages. Critically, fine-tuning redistributes rather than sharpens language selectivity: neurons for supervised and related languages become less exclusive, while those for unsupervised languages grow more isolated. This redistribution increases representational divergence in higher layers that govern generation. These findings suggest that multilingual fine-tuning may reshape geometry in ways that reduce compatibility with standard weight-space merging assumptions. Our work thus provides an explanation for why merging fails in multilingual translation scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Merging fails for multilingual MT because fine-tuning redistributes neuron selectivity rather than sharpening it, though the causal link stays correlational.

read the letter

The main point is that weight-space merging, which often works for multitask models, degrades performance in multilingual machine translation, especially across dissimilar languages. The authors fine-tune on large bilingual corpora and test standard merging approaches, then use span-conditioned neuron selectivity and layer-wise CKA to examine what happens inside the models. They report that language-specific neurons cluster in embeddings and upper layers, with intermediate layers staying more shared. Fine-tuning appears to spread selectivity out—supervised languages lose exclusivity while unsupervised ones become more isolated—which correlates with greater divergence in the higher layers that control generation. This offers a plausible account for why merging assumptions break down here. The experiments cover multiple languages and merging strategies, which gives the field some concrete data it did not have before. The analysis is a reasonable first step toward understanding the representational geometry. The soft spot is that the redistribution is measured post-hoc, so it remains unclear whether it is the main driver or whether data imbalance, optimization differences, or gradient conflicts play larger roles. No intervention tests isolate the factor. This paper is useful for researchers working on efficient multilingual systems who want to avoid full joint training. It flags a practical obstacle and gives an initial mechanistic hint. I would send it for peer review; the empirical observations are worth referee input even if the explanation needs tighter controls.

Referee Report

2 major / 2 minor

Summary. The paper examines weight-space merging of independently fine-tuned multilingual machine translation models. It reports systematic performance degradation after merging, especially when target languages differ, and uses span-conditioned neuron selectivity and layer-wise CKA to argue that fine-tuning redistributes rather than sharpens language selectivity—making supervised-language neurons less exclusive and unsupervised ones more isolated—which increases representational divergence in higher layers that control generation.

Significance. If the central empirical observations hold, the work supplies a concrete explanation for why standard merging techniques fail in multilingual MT, highlighting how multilingual fine-tuning alters transformer geometry in ways incompatible with current weight-space assumptions. The systematic comparison across merging strategies and the focus on internal representations provide useful diagnostic tools for future merging research.

major comments (2)

[§4] §4 (merging experiments): Performance degradation is documented, but the manuscript provides insufficient detail on control conditions, joint-training baselines, and statistical tests for the reported drops; without these it is difficult to quantify how much of the failure is attributable to merging versus other factors.
[§5] §5 (representation analysis): The claim that neuron-selectivity redistribution is the primary causal driver of increased higher-layer divergence and merging failure rests entirely on post-hoc correlations from span-conditioned selectivity and CKA; no intervention, ablation, or controlled comparison isolates this mechanism from confounds such as language-specific data volume, optimization trajectories, or gradient conflicts.

minor comments (2)

[§5.1] The exact operational definition and hyper-parameters of 'span-conditioned neuron selectivity' should be stated more explicitly so that the measure can be reproduced.
[Figure 3] Figure captions and axis labels for the CKA heatmaps could be expanded to indicate the precise layer ranges and language pairs being compared.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript where feasible to improve clarity and rigor.

read point-by-point responses

Referee: [§4] §4 (merging experiments): Performance degradation is documented, but the manuscript provides insufficient detail on control conditions, joint-training baselines, and statistical tests for the reported drops; without these it is difficult to quantify how much of the failure is attributable to merging versus other factors.

Authors: We agree that additional controls and baselines are necessary to better isolate the contribution of merging. In the revised §4 we now include joint-training baselines in which a single model is trained on the concatenated bilingual data for the relevant language pairs. We also report all merging results with standard deviations computed over three independent runs and include paired t-tests to evaluate the statistical significance of the performance drops relative to the unmerged models. These additions clarify the extent to which degradation is attributable to merging rather than data volume or training procedure. revision: yes
Referee: [§5] §5 (representation analysis): The claim that neuron-selectivity redistribution is the primary causal driver of increased higher-layer divergence and merging failure rests entirely on post-hoc correlations from span-conditioned selectivity and CKA; no intervention, ablation, or controlled comparison isolates this mechanism from confounds such as language-specific data volume, optimization trajectories, or gradient conflicts.

Authors: We concur that the current evidence is correlational. While we cannot introduce new interventional ablations within the scope of this revision, we have strengthened §5 by (i) reporting quantitative correlation coefficients between changes in span-conditioned selectivity and layer-wise CKA divergence, (ii) adding a controlled subsampling experiment in the appendix that equalizes data volume across languages, and (iii) explicitly discussing alternative explanations such as optimization trajectories and gradient conflicts in an expanded limitations paragraph. These steps provide tighter observational support but do not fully isolate causality. revision: partial

standing simulated objections not resolved

Full causal isolation of neuron-selectivity redistribution from all listed confounds would require targeted interventions or ablations that are computationally prohibitive in the present revision cycle.

Circularity Check

0 steps flagged

No significant circularity: empirical observations from experiments and representation analysis

full rationale

The paper reports results from fine-tuning multilingual models on bilingual corpora, applying standard merging strategies, and measuring performance degradation plus internal representations via span-conditioned neuron selectivity and layer-wise CKA. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. Central claims rest on direct experimental measurements rather than any reduction to inputs by construction. The analysis is self-contained against external benchmarks of model behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of two standard analysis techniques and the interpretation that observed selectivity changes cause merging incompatibility; no new free parameters or invented entities are introduced.

axioms (2)

domain assumption Span-conditioned neuron selectivity reliably identifies language-specific neurons
Invoked to locate language specialization within layers
standard math Layer-wise centered kernel alignment measures meaningful representational similarity across languages
Used to quantify divergence between language representations

pith-pipeline@v0.9.0 · 5486 in / 1300 out tokens · 57593 ms · 2026-05-13T20:01:53.344028+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuning redistributes rather than sharpens language selectivity: neurons for supervised and related languages become less exclusive, while those for unsupervised languages grow more isolated. This redistribution increases representational divergence in higher layers
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

layer-wise centered kernel alignment... principal angles between their representation subspaces

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 4 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. 2023. https://openreview.net/forum?id=CQsmMYmlP5T Git re-basin: Merging models modulo permutation symmetries . In The Eleventh International Conference on Learning Representations

work page 2023
[4]

Dibyanayan Bandyopadhyay, Soham Bhattacharjee, and Asif Ekbal. 2025. http://arxiv.org/abs/2503.10814 Thinking machines: A survey of llm based reasoning strategies

work page arXiv 2025
[5]

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. 2024. https://openreview.net/forum?id=aloEru2qCG Lo RA learns less and forgets less . Transactions on Machine Learning Research. Featured Certification

work page 2024
[6]

Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K. Bergen. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.236 When is multilinguality a curse? language modeling for 250 high- and low-resource languages . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4074--4096, Miami, Florida, USA. Associati...

work page doi:10.18653/v1/2024.emnlp-main.236 2024
[7]

Sanjay Chouhan, Shubha Brata Nath, and Aparajita Dutta. 2024. https://doi.org/10.1007/978-3-031-78172-8_17 Hindillm: Large language model for hindi . In Pattern Recognition: 27th International Conference, ICPR 2024, Kolkata, India, December 1–5, 2024, Proceedings, Part VI, page 255–270, Berlin, Heidelberg. Springer-Verlag

work page doi:10.1007/978-3-031-78172-8_17 2024
[8]

Alphaeus Dmonte, Vidhi Gupta, Daniel J Perry, and Mark Arehart. 2026. http://arxiv.org/abs/2601.16127 Improving training efficiency and reducing maintenance costs via language specific model merging

work page arXiv 2026
[9]

Baban Gain, Dibyanayan Bandyopadhyay, Asif Ekbal, and Trilok Nath Singh. 2026. http://arxiv.org/abs/2504.01919 Bridging the linguistic divide: A survey on leveraging large language models for machine translation

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodolà. 2025. http://arxiv.org/abs/2412.00081 Task singular vectors: Reducing task interference in model merging

work page arXiv 2025
[11]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.446 Transformer feed-forward layers are key-value memories . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484--5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
[12]

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.36 Arcee ' s M erge K it: A toolkit for merging large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: ...

work page doi:10.18653/v1/2024.emnlp-industry.36 2024
[13]

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch\" o lkopf. 2005. https://doi.org/10.1007/11564089_7 Measuring statistical dependence with hilbert-schmidt norms . In Proceedings of the 16th International Conference on Algorithmic Learning Theory, ALT'05, page 63–77, Berlin, Heidelberg. Springer-Verlag

work page doi:10.1007/11564089_7 2005
[14]

Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef Van Genabith, and Simon Ostermann. 2025. https://aclanthology.org/2025.ijcnlp-long.156/ Language arithmetics: Towards systematic language neuron identification and manipulation . In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th C...

work page 2025
[15]

Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, and Xia Song. 2024. http://arxiv.org/abs/2410.12883 Scaling laws for multilingual language models

work page arXiv 2024
[16]

Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, and Han Zhao. 2025. http://arxiv.org/abs/2505.10833 Mergebench: A benchmark for merging domain-specialized llms

work page arXiv 2025
[17]

Oğuz Kağan Hitit, Leander Girrbach, and Zeynep Akata. 2025. http://arxiv.org/abs/2511.21437 A systematic study of model merging techniques in large language models

work page arXiv 2025
[18]

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. 2024. https://openreview.net/forum?id=lYdjzx3DYu EMR -merging: Tuning-free high-performance model merging . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[19]

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2022. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. https://doi.org/10.1145/3747588 A survey on large language models for code generation . ACM Trans. Softw. Eng. Methodol., 35(2)

work page doi:10.1145/3747588 2026
[21]

Prashant Kodali, Vaishnavi Shivkumar, Swarang Joshi, Monojit Choudhary, Ponnurangam Kumaraguru, and Manish Shrivastava. 2025. http://arxiv.org/abs/2510.19782 Adapting multilingual models to code-mixed tasks via model merging

work page arXiv 2025
[22]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. https://proceedings.mlr.press/v97/kornblith19a.html Similarity of neural network representations revisited . In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3519--3529. PMLR

work page 2019
[23]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In International Conference on Learning Representations

work page 2019
[24]

Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703--17716

work page 2022
[25]

Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. 2022. https://openreview.net/forum?id=-h6WAS6eE4 Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems

work page 2022
[26]

Soumen Kumar Mondal, Sayambhu Sen, Abhishek Singhania, and Preethi Jyothi. 2025. https://doi.org/10.18653/v1/2025.insights-1.6 Language-specific neurons do not facilitate cross-lingual transfer . In The Sixth Workshop on Insights from Negative Results in NLP, pages 46--62, Albuquerque, New Mexico. Association for Computational Linguistics

work page doi:10.18653/v1/2025.insights-1.6 2025
[27]

Biqing Qi, Fangyuan Li, Zhen Wang, Junqi Gao, Dong Li, Peng Ye, and Bowen Zhou. 2024. http://arxiv.org/abs/2412.00054 Less is more: Efficient model merging with binary task switch

work page arXiv 2024
[28]

Xingyu Qu and Samuel Horv\' a th. 2025. https://proceedings.mlr.press/v280/qu25a.html Vanishing feature: Diagnosing model merging and beyond . In Conference on Parsimony and Learning, volume 280 of Proceedings of Machine Learning Research, pages 1051--1086. PMLR

work page 2025
[29]

Nishat Raihan and Marcos Zampieri. 2025. https://doi.org/10.18653/v1/2025.acl-short.69 T iger LLM - a family of B angla large language models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 887--896, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.acl-short.69 2025
[30]

Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh Shantadevi Khapra. 2022. https://doi.org/10.1162/tacl_a_00452 Sa...

work page doi:10.1162/tacl_a_00452 2022
[31]

Thibault Rousset, Taisei Kakibuchi, Yusuke Sasaki, and Yoshihide Nomura. 2025. http://arxiv.org/abs/2502.12001 Merging language and domain specific models: The impact on technical vocabulary acquisition

work page arXiv 2025
[32]

Skyler Seto, Maartje Ter Hoeve, Maureen de Seyssel, and David Grangier. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.1236 Assessing the role of data quality in training bilingual language models . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 22694--22720, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-emnlp.1236 2025
[33]

Ronald Skorobogat, Karsten Roth, and Mariana-Iuliana Georgescu. 2025. http://arxiv.org/abs/2506.16506 Subspace-boosted model merging

work page arXiv 2025
[34]

Tianyi Tang, Wenyang Luo, Haoyang Huang, Dongdong Zhang, Xiaolei Wang, Xin Zhao, Furu Wei, and Ji-Rong Wen. 2024. https://doi.org/10.18653/v1/2024.acl-long.309 Language-specific neurons: The key to multilingual capabilities in large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

work page doi:10.18653/v1/2024.acl-long.309 2024
[35]

Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, and Yansong Feng. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.508 Unlocking the potential of model merging for low-resource languages . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8705--8720, Miami, Florida, USA. Association for Com...

work page doi:10.18653/v1/2024.findings-emnlp.508 2024
[36]

NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Elena Voita, Javier Ferrando, and Christoforos Nalmpantis. 2024. https://doi.org/10.18653/v1/2024.findings-acl.75 Neurons in large language models: Dead, n-gram, positional . In Findings of the Association for Computational Linguistics: ACL 2024, pages 1288--1301, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-acl.75 2024
[38]

Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, and Xiaojun Quan. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1096 F use C hat: Knowledge fusion of chat models . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21618--21642, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.emnlp-main.1096 2025
[39]

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pages 23965--...

work page 2022
[40]

Yuxin Xiao, Zhen Huang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, and Jieping Ye. 2026. https://openreview.net/forum?id=HIXPyQ1aMq How do language models speak languages? a case study on unintended code-switching

work page 2026
[41]

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. 2023. Ties-merging: resolving interference when merging models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc

work page 2023
[42]

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. 2026. https://doi.org/10.1145/3787849 Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities . ACM Comput. Surv., 58(8)

work page doi:10.1145/3787849 2026
[43]

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. 2024. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In Forty-first International Conference on Machine Learning

work page 2024
[44]

Yang Zhang, Hanlei Jin, Dan Meng, Jun Wang, and Jinghua Tan. 2026. https://doi.org/https://doi.org/10.1016/j.neucom.2025.131928 A comprehensive survey on automatic text summarization with exploration of llm-based methods . Neurocomputing, 663:131928

work page doi:10.1016/j.neucom.2025.131928 2026
[45]

Yiran Zhao, Wenxuan Zhang, Huiming Wang, Kenji Kawaguchi, and Lidong Bing. 2025. https://doi.org/10.18653/v1/2025.naacl-long.493 A da M erge X : Cross-lingual transfer with large language models via adaptive adapter merging . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Huma...

work page doi:10.18653/v1/2025.naacl-long.493 2025
[46]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. 2024. https://doi.org/10.18653/v1/2024.acl-demos.38 L lama F actory: Unified efficient fine-tuning of 100+ language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400--410, Bangkok, Thailand. A...

work page doi:10.18653/v1/2024.acl-demos.38 2024