On the Limits of Model Merging for Multilinguality in Pre-Training

Aleksandr Umnov; Christof Monz; Fedor Vitiugin; Khalil Sima'an; Seth Aycock

arxiv: 2605.25846 · v1 · pith:652SEMUFnew · submitted 2026-05-25 · 💻 cs.CL

On the Limits of Model Merging for Multilinguality in Pre-Training

Seth Aycock , Fedor Vitiugin , Aleksandr Umnov , Christof Monz , Khalil Sima'an This is my paper

Pith reviewed 2026-06-29 21:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords model mergingmultilingual pre-trainingperformance collapserepresentational similarityinterferencemonolingual modelsfine-tuning

0 comments

The pith

Merging any combination of monolingual pre-trained models leads to multilingual performance collapse due to interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether post-training model merging can turn separate monolingual pre-trained models into a single multilingual one. It finds that monolingual pre-training produces strong in-language results, yet any merge of those models triggers sharp drops in performance across languages. The cause is interference arising from dissimilar internal representations. This shows that the merging approach known to work in fine-tuning does not transfer to the pre-training stage for building multilinguality.

Core claim

Monolingual pre-training results in strong in-language performance, but merging any combination of monolingual models leads to performance collapse due to interference. Representational similarity is a prerequisite for model merging. The flexibility of merging in fine-tuning therefore does not extend trivially to language-specific pre-training.

What carries the argument

Representational similarity between models, required to prevent interference during merging.

If this is right

Mixed pre-training data is required to achieve consistent multilingual performance.
Model merging works only when the base models already share similar representations.
Isolated monolingual pre-training produces models that cannot be merged without loss.
Post-training merging cannot substitute for joint pre-training when building multilingual capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Merging may succeed after models have undergone some shared pre-training that aligns their representations.
The result suggests that fine-tuning merging benefits from prior convergence on common features that pre-training merging lacks.
Scaling multilingual systems may need to prioritize joint data mixing over modular merging strategies.

Load-bearing premise

The monolingual models were trained in complete isolation with no shared data, vocabulary overlap, or initialization that could reduce representational differences.

What would settle it

A controlled experiment in which merging two monolingual models succeeds without collapse once the models are made to share high representational similarity through joint initialization or overlapping data.

Figures

Figures reproduced from arXiv: 2605.25846 by Aleksandr Umnov, Christof Monz, Fedor Vitiugin, Khalil Sima'an, Seth Aycock.

**Figure 2.** Figure 2: MultiBLiMP accuracy per-language for linear [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Mean layer-wise CKA between monolingual HPLT1 models significantly correlates (in terms of Spearman’s ρ and Pearson’s r) with smaller merge performance drop ∆ from monolingual to bilingual merged models. This suggests increasing representational similarity improves merge success. prior work suggesting that representational similarity is an essential ingredient for successful merging of heterogenous mo… view at source ↗

read the original abstract

Endowing models with consistent multilingual performance can be achieved by mixing pre-training data, or post-training approaches such as language-specific model merging. In this work, we test whether merging can be applied to monolingually pre-trained models. We conduct a controlled study on the efficacy of mixed, merged, and monolingual pre-training setups. We find that while monolingual pre-training results in strong in-language performance, merging any combination of monolingual models leads to performance collapse due to interference. Our analysis suggests representational similarity is a prerequisite for model merging. We therefore conclude that the flexibility of merging in fine-tuning does not extend trivially to language-specific pre-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Merging monolingual pre-trained models collapses due to interference while mixed pre-training holds up, but the isolation of the models is not clearly established.

read the letter

The key point here is that separate monolingual pre-training followed by merging fails badly, while training on mixed data from the start does not. The authors position this as evidence that merging works in fine-tuning but does not transfer to the pre-training stage because the models lack sufficient representational similarity.

The controlled comparison between the three setups is the main contribution. They show strong in-language results from monolingual training and then document the collapse after merging. That negative finding is useful for anyone considering merging as a compute-saving step at pre-training scale.

The main weakness is the lack of any experimental detail in the abstract and the open question about how isolated the monolingual models actually were. If the models shared a tokenizer, vocabulary, or random seed, the interference could be an artifact of that overlap rather than a general limit on merging. The paper would need to show the training configurations clearly to rule this out. Without those specifics it is difficult to judge how far the result generalizes.

This is relevant for people building multilingual models who are weighing data mixing against post-hoc merging. A reader already working on model merging or multilingual pre-training would find the comparison worth seeing, even if the current version is light on numbers and methods.

It is worth sending to referees. The question is practical and the setup is a reasonable test, but any review would need to press for full experimental transparency and checks on shared components.

Referee Report

1 major / 1 minor

Summary. The paper claims that while monolingual pre-training yields strong in-language performance, merging any combination of monolingual models leads to performance collapse due to interference. It concludes from a controlled study of mixed, merged, and monolingual pre-training setups that representational similarity is a prerequisite for model merging, and that the flexibility of merging observed in fine-tuning does not extend trivially to language-specific pre-training.

Significance. If the empirical result holds under properly isolated conditions, it would indicate a fundamental limit on using post-hoc model merging to achieve multilinguality at pre-training scale, thereby reinforcing the necessity of mixed-data pre-training over merging-based alternatives.

major comments (1)

[Abstract] Abstract: the central claim that 'merging any combination of monolingual models leads to performance collapse due to interference' is load-bearing for the conclusion, yet the abstract (and by extension the controlled study description) supplies no information on whether the monolingual models shared a tokenizer, BPE vocabulary, or random initialization. This directly affects whether the observed collapse can be attributed to interference rather than insufficient isolation between the models.

minor comments (1)

[Abstract] The abstract would benefit from explicit mention of the metrics, baselines, model sizes, and languages used in the controlled study to allow immediate assessment of the strength of the negative result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. The major comment identifies a clarity issue in the abstract that we will address through revision.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'merging any combination of monolingual models leads to performance collapse due to interference' is load-bearing for the conclusion, yet the abstract (and by extension the controlled study description) supplies no information on whether the monolingual models shared a tokenizer, BPE vocabulary, or random initialization. This directly affects whether the observed collapse can be attributed to interference rather than insufficient isolation between the models.

Authors: We agree that the abstract should explicitly state these experimental conditions to strengthen the attribution to interference. Each monolingual model was trained independently from a distinct random initialization on language-specific data; a shared BPE vocabulary was used across models to enable merging. We will revise the abstract to include this information. The methods section of the paper already details the independent training procedure, but we will ensure the abstract summary is self-contained on this point. revision: yes

Circularity Check

0 steps flagged

Empirical claims with no derivation chain or self-referential reduction

full rationale

The paper reports results from a controlled empirical study comparing mixed, merged, and monolingual pre-training setups. The central finding—that merging monolingual models leads to performance collapse—is presented as an observation from experiments rather than a mathematical derivation. No equations, fitted parameters renamed as predictions, or self-citation chains that bear the load of the main claim appear in the abstract or described structure. The analysis of representational similarity as a prerequisite is framed as a post-hoc suggestion from the data, not a self-definitional or ansatz-smuggled result. The study is self-contained against external benchmarks via its experimental controls, with no load-bearing steps that reduce by construction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new entities introduced by the paper.

pith-pipeline@v0.9.1-grok · 5647 in / 1030 out tokens · 35075 ms · 2026-06-29T21:18:33.024610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 32 canonical work pages · 10 internal anchors

[1]

Aakanksha, Arash Ahmadian, Seraphina Goldfarb-Tarrant, Beyza Ermis, Marzieh Fadaee, and Sara Hooker. 2024. https://doi.org/10.48550/arXiv.2410.10801 Mix Data or Merge Models ? Optimizing for Diverse Multi - Task Learning . arXiv preprint. ArXiv:2410.10801 [cs]

work page doi:10.48550/arxiv.2410.10801 2024
[2]

Divyanshu Aggarwal, Sankarshan Damle, Navin Goyal, Satya Lokam, and Sunayana Sitaram. 2024. https://openreview.net/forum?id=lUl3Iz4k64 Towards exploring continual fine-tuning for enhancing language ability in large language model . In NeurIPS 2024 Workshop on Scalable Continual Learning for Lifelong Foundation Models

2024
[3]

Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. 2023. https://openreview.net/forum?id=CQsmMYmlP5T Git re-basin: Merging models modulo permutation symmetries . In The Eleventh International Conference on Learning Representations

2023
[4]

Nikolay Arefyev, Mikko Aulamo, Marta Ba \ n \'o n, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Haji c , Jind r ich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Amanda Myntti, Dayy \'a n O ' Brien, and 8 others. 2025. https://aclantholo...

2025
[5]

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. https://doi.org/10.18653/v1/2024.acl-long.44 The Belebele Benchmark : a Parallel Reading Comprehension Dataset in 122 Language Variants . In Proceedings of the 62nd Annual Meeting of ...

work page doi:10.18653/v1/2024.acl-long.44 2024
[6]

Lucas Bandarkar and Nanyun Peng. 2025. https://doi.org/10.18653/v1/2025.mrl-main.10 The Unreasonable Effectiveness of Model Merging for Cross - Lingual Transfer in LLMs . In Proceedings of the 5th Workshop on Multilingual Representation Learning ( MRL 2025) , pages 131--148, Suzhuo, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.mrl-main.10 2025
[7]

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, and 16 others. 2025. https://doi.org/10.18653/v1/2025...

work page doi:10.18653/v1/2025.acl-long.854 2025
[8]

Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K. Bergen. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.236 When Is Multilinguality a Curse ? Language Modeling for 250 High - and Low - Resource Languages . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 4074--4096, Miami, Florida, USA. Asso...

work page doi:10.18653/v1/2024.emnlp-main.236 2024
[9]

Goldfish: Monolingual Language Models for 350 Languages

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K. Bergen. 2026. https://doi.org/10.48550/arXiv.2408.10441 Goldfish: Monolingual Language Models for 350 Languages . arXiv preprint. ArXiv:2408.10441 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.10441 2026
[10]

Shilian Chen, Jie Zhou, Qin Chen, Wen Wu, Xin Li, Qi Feng, and Liang He. 2026. https://doi.org/10.48550/arXiv.2604.01674 Can Heterogeneous Language Models Be Fused ? arXiv preprint. ArXiv:2604.01674 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.01674 2026
[11]

Alexandra Chronopoulou, Jonas Pfeiffer, Joshua Maynez, Xinyi Wang, Sebastian Ruder, and Priyanka Agrawal. 2024. https://doi.org/10.18653/v1/2024.mrl-1.7 Language and Task Arithmetic with Parameter - Efficient Layers for Zero - Shot Summarization . In Proceedings of the Fourth Workshop on Multilingual Representation Learning ( MRL 2024) , pages 114--126, M...

work page doi:10.18653/v1/2024.mrl-1.7 2024
[12]

Team Cohere, Aakanksha, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, Raphaël Avalos, Zahara Aviv, Sammie Bae, Saurabh Baji, Alexandre Barbet, Max Bartolo, Björn Bebensee, Neeral Beladia, and 210 others. 2025. https://doi.org/10.48550/arXiv.2504.00698 C...

work page doi:10.48550/arxiv.2504.00698 2025
[13]

Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, and 20 others. 2024. https://doi.org/10.1038/s41586-024-07335-...

work page doi:10.1038/s41586-024-07335-x 2024
[14]

Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, and Jörg Tiedemann. 2024. https://aclanthology.org/2024.lrec-main.100/ A New Massive Multilingual Dataset for High - Performance Language Technologies . In ...

2024
[15]

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, and Yves Scherrer. 2026. https://doi.org/10.48550/arXiv.2602.13139 OpenLID -v3: Improving the Precision of Closely Related Language Identification -- An Experience Report . arXiv preprint. ArXiv:2602.13139 [cs] version: 2

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.13139 2026
[16]

Negar Foroutan, Paul Teiletche, Ayush Kumar Tarun, and Antoine Bosselut. 2025. https://doi.org/10.48550/arXiv.2510.25947 Revisiting Multilingual Data Mixtures in Language Model Pretraining . arXiv preprint. ArXiv:2510.25947 [cs]

work page doi:10.48550/arxiv.2510.25947 2025
[17]

Baban Gain, Asif Ekbal, and Trilok Nath Singh. 2026. https://doi.org/10.48550/arXiv.2604.02881 One Model to Translate Them All ? A Journey to Mount Doom for Multilingual Model Merging . arXiv preprint. ArXiv:2604.02881 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.02881 2026
[18]

Kevin Glocker, Kätriin Kukk, Romina Oji, Marcel Bollmann, Marco Kuhlmann, and Jenny Kunz. 2025. https://doi.org/10.48550/arXiv.2512.10772 Grow Up and Merge : Scaling Strategies for Efficient Language Adaptation . arXiv preprint. ArXiv:2512.10772 [cs]

work page doi:10.48550/arxiv.2512.10772 2025
[19]

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.36 Arcee's MergeKit : A Toolkit for Merging Large Language Models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing : Ind...

work page doi:10.18653/v1/2024.emnlp-industry.36 2024
[20]

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2023. https://openreview.net/forum?id=6t0Kwf8-jrj Editing models with task arithmetic . In The Eleventh International Conference on Learning Representations

2023
[21]

Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, and Arianna Bisazza. 2026. https://doi.org/10.1162/TACL.a.600 MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs . Transactions of the Association for Computational Linguistics, 14:193--216

work page doi:10.1162/tacl.a.600 2026
[22]

Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. 2025. https://doi.org/10.1145/3728458 Similarity of Neural Network Models : A Survey of Functional and Representational Measures . ACM Comput. Surv., 57(9):242:1--242:52

work page doi:10.1145/3728458 2025
[23]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. https://proceedings.mlr.press/v97/kornblith19a.html Similarity of Neural Network Representations Revisited . In Proceedings of the 36th International Conference on Machine Learning , pages 3519--3529. PMLR

2019
[24]

Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, and Thien Nguyen. 2023. https://doi.org/10.18653/v1/2023.emnlp-demo.28 Okapi: Instruction -tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.18653/v1/2023.emnlp-demo.28 2023
[25]

Guillaume Lample, Alexis Conneau, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. https://openreview.net/forum?id=H196sainb Word translation without parallel data . In International Conference on Learning Representations

2018
[26]

Smith, and Luke Zettlemoyer

Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. 2022. https://openreview.net/forum?id=SQgVgE2Sq4 Branch-train-merge: Embarrassingly parallel training of expert language models . In First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022

2022
[27]

Bill Yuchen Lin, Seyeon Lee, Xiaoyang Qiao, and Xiang Ren. 2021. https://doi.org/10.18653/v1/2021.acl-long.102 Common Sense Beyond English : Evaluating and Improving Multilingual Language Models for Commonsense Reasoning . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference...

work page doi:10.18653/v1/2021.acl-long.102 2021
[28]

Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin

Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. https://aclanthology.org/E17-2002/ URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume ...

2017
[29]

Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Rayburn Caswell, Alex Pentland, Sercan O Arik, Chen-Yu Lee, and Sayna Ebrahimi. 2026. https://openreview.net/forum?id=0BkvUY61MX ATLAS : Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality . In The Fourteenth International ...

2026
[30]

Guerreiro, Ricardo Rei, Duarte M

Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, and André F. T. Martins. 2024. https://doi.org/10.48550/arXiv.2409.16235 EuroLLM : Multilingual Language Models for Euro...

work page doi:10.48550/arxiv.2409.16235 2024
[31]

Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi, Walid Irhaymi, Jin Yan, Tamara Serrano, Elena Pagliarini, Fritz Günther, and Evelina Leivada. 2026. https://doi.org/10.48550/arXiv.2602.20065 Multilingual Large Language Models do not comprehend all natural languages to equal degrees . arXiv preprint. ArXiv:2602.20065 [cs]

work page doi:10.48550/arxiv.2602.20065 2026
[32]

OpenEuroLLM. 2025. https://openeurollm.eu/blog/hplt-oellm-38-reference-models Release of 38 Monolingual 2. 15B LLMs Trained on HPLT v2

2025
[33]

Marinela Parović, Ivan Vulić, and Anna Korhonen. 2024. https://doi.org/10.18653/v1/2024.eacl-short.12 Investigating the Potential of Task Arithmetic for Cross - Lingual Transfer . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics ( Volume 2: Short Papers ) , pages 124--137, St. Julian's, Malta. ...

work page doi:10.18653/v1/2024.eacl-short.12 2024
[34]

Maja Popović. 2017. https://doi.org/10.18653/v1/W17-4770 chrF ++: words helping character n-grams . In Proceedings of the Second Conference on Machine Translation , pages 612--618, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/w17-4770 2017
[35]

Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, and 178 others. 2024. https://doi.org/10.48550/arXiv.2408.0...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024
[36]

Alejandro R. Salamanca, Diana Abagyan, Daniel D'souza, Ammar Khairi, David Mora, Saurabh Dash, Viraat Aryabumi, Sara Rajaee, Mehrnaz Mofakhami, Ananya Sahu, Thomas Euyang, Brittawnya Prince, Madeline Smith, Hangyu Lin, Acyr Locatelli, Sara Hooker, Tom Kocmi, Aidan Gomez, Ivan Zhang, and 7 others. 2026. https://doi.org/10.48550/arXiv.2603.11510 Tiny Aya : ...

work page doi:10.48550/arxiv.2603.11510 2026
[37]

Nour Shaheen, Sarath Chandar, Boris Knyazev, and Ekaterina Lobacheva. 2026. https://openreview.net/forum?id=FGbtxnaWk4 Is depth heterogeneity a barrier to model merging? In Third Workshop on Test-Time Updates (Main Track)

2026
[38]

Chen Shani, Yuval Reif, Nathan Roll, Dan Jurafsky, and Ekaterina Shutova. 2026. https://doi.org/10.48550/arXiv.2601.07220 The Roots of Performance Disparity in Multilingual Language Models : Intrinsic Modeling Difficulty or Design Choices ? arXiv preprint. ArXiv:2601.07220 [cs] version: 2

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.07220 2026
[39]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. https://doi.org/10.48550/arXiv.1909.08053 Megatron- LM : Training Multi - Billion Parameter Language Models Using Model Parallelism . arXiv preprint. ArXiv:1909.08053 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1909.08053 2020
[40]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 others. 2023. https://doi.org/10.48550/arXiv.2307.09288 Llama 2...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[41]

Rob van der Goot, Esther Ploeger, Verena Blaschke, and Tanja Samardzic. 2025. https://doi.org/10.18653/v1/2025.emnlp-demos.23 D ista L s: a comprehensive collection of language distance measures . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 307--318, Suzhou, China. Association for...

work page doi:10.18653/v1/2025.emnlp-demos.23 2025
[42]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. 2022. https://proceedings.mlr.press/v162/wortsman22a.html Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference ti...

2022
[43]

Raffel, and Mohit Bansal

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A. Raffel, and Mohit Bansal. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1644c9af28ab7916874f6fd6228a9bcf-Abstract-Conference.html TIES - Merging : Resolving Interference When Merging Models . Advances in Neural Information Processing Systems, 36:7093--7115

2023
[44]

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. 2024. https://doi.org/10.48550/arXiv.2408.07666 Model Merging in LLMs , MLLMs , and Beyond : Methods , Theories , Applications and Opportunities . arXiv preprint. ArXiv:2408.07666 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.07666 2024
[45]

Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Ziyu Zhao, Daixin Wang, Qing Cui, Zhiqiang Zhang, Jun Zhou, Fei Wu, and Kun Kuang. 2025. https://doi.org/10.48550/arXiv.2502.06876 Mix Data or Merge Models ? Balancing the Helpfulness , Honesty , and Harmlessness of Large Language Model via Model Merging . arXiv preprint. ArXiv:2502.06876 [cs]

work page doi:10.48550/arxiv.2502.06876 2025
[46]

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. 2024. https://openreview.net/forum?id=fq0NaiU8Ex Language models are super mario: Absorbing abilities from homologous models as a free lunch . In Forty-first International Conference on Machine Learning

2024
[47]

Siqi Zeng, Yifei He, Weiqiu You, Yifan Hao, Yao-Hung Hubert Tsai, Makoto Yamada, and Han Zhao. 2025. https://doi.org/10.48550/arXiv.2502.01015 Efficient Model Editing with Task Vector Bases : A Theoretical Framework and Scalable Approach . arXiv preprint. ArXiv:2502.01015 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.01015 2025
[48]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[49]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

Aakanksha, Arash Ahmadian, Seraphina Goldfarb-Tarrant, Beyza Ermis, Marzieh Fadaee, and Sara Hooker. 2024. https://doi.org/10.48550/arXiv.2410.10801 Mix Data or Merge Models ? Optimizing for Diverse Multi - Task Learning . arXiv preprint. ArXiv:2410.10801 [cs]

work page doi:10.48550/arxiv.2410.10801 2024

[2] [2]

Divyanshu Aggarwal, Sankarshan Damle, Navin Goyal, Satya Lokam, and Sunayana Sitaram. 2024. https://openreview.net/forum?id=lUl3Iz4k64 Towards exploring continual fine-tuning for enhancing language ability in large language model . In NeurIPS 2024 Workshop on Scalable Continual Learning for Lifelong Foundation Models

2024

[3] [3]

Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. 2023. https://openreview.net/forum?id=CQsmMYmlP5T Git re-basin: Merging models modulo permutation symmetries . In The Eleventh International Conference on Learning Representations

2023

[4] [4]

Nikolay Arefyev, Mikko Aulamo, Marta Ba \ n \'o n, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Haji c , Jind r ich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Amanda Myntti, Dayy \'a n O ' Brien, and 8 others. 2025. https://aclantholo...

2025

[5] [5]

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. https://doi.org/10.18653/v1/2024.acl-long.44 The Belebele Benchmark : a Parallel Reading Comprehension Dataset in 122 Language Variants . In Proceedings of the 62nd Annual Meeting of ...

work page doi:10.18653/v1/2024.acl-long.44 2024

[6] [6]

Lucas Bandarkar and Nanyun Peng. 2025. https://doi.org/10.18653/v1/2025.mrl-main.10 The Unreasonable Effectiveness of Model Merging for Cross - Lingual Transfer in LLMs . In Proceedings of the 5th Workshop on Multilingual Representation Learning ( MRL 2025) , pages 131--148, Suzhuo, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.mrl-main.10 2025

[7] [7]

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, and 16 others. 2025. https://doi.org/10.18653/v1/2025...

work page doi:10.18653/v1/2025.acl-long.854 2025

[8] [8]

Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K. Bergen. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.236 When Is Multilinguality a Curse ? Language Modeling for 250 High - and Low - Resource Languages . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 4074--4096, Miami, Florida, USA. Asso...

work page doi:10.18653/v1/2024.emnlp-main.236 2024

[9] [9]

Goldfish: Monolingual Language Models for 350 Languages

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K. Bergen. 2026. https://doi.org/10.48550/arXiv.2408.10441 Goldfish: Monolingual Language Models for 350 Languages . arXiv preprint. ArXiv:2408.10441 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.10441 2026

[10] [10]

Shilian Chen, Jie Zhou, Qin Chen, Wen Wu, Xin Li, Qi Feng, and Liang He. 2026. https://doi.org/10.48550/arXiv.2604.01674 Can Heterogeneous Language Models Be Fused ? arXiv preprint. ArXiv:2604.01674 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.01674 2026

[11] [11]

Alexandra Chronopoulou, Jonas Pfeiffer, Joshua Maynez, Xinyi Wang, Sebastian Ruder, and Priyanka Agrawal. 2024. https://doi.org/10.18653/v1/2024.mrl-1.7 Language and Task Arithmetic with Parameter - Efficient Layers for Zero - Shot Summarization . In Proceedings of the Fourth Workshop on Multilingual Representation Learning ( MRL 2024) , pages 114--126, M...

work page doi:10.18653/v1/2024.mrl-1.7 2024

[12] [12]

Team Cohere, Aakanksha, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, Raphaël Avalos, Zahara Aviv, Sammie Bae, Saurabh Baji, Alexandre Barbet, Max Bartolo, Björn Bebensee, Neeral Beladia, and 210 others. 2025. https://doi.org/10.48550/arXiv.2504.00698 C...

work page doi:10.48550/arxiv.2504.00698 2025

[13] [13]

Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, and 20 others. 2024. https://doi.org/10.1038/s41586-024-07335-...

work page doi:10.1038/s41586-024-07335-x 2024

[14] [14]

Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, and Jörg Tiedemann. 2024. https://aclanthology.org/2024.lrec-main.100/ A New Massive Multilingual Dataset for High - Performance Language Technologies . In ...

2024

[15] [15]

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, and Yves Scherrer. 2026. https://doi.org/10.48550/arXiv.2602.13139 OpenLID -v3: Improving the Precision of Closely Related Language Identification -- An Experience Report . arXiv preprint. ArXiv:2602.13139 [cs] version: 2

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.13139 2026

[16] [16]

Negar Foroutan, Paul Teiletche, Ayush Kumar Tarun, and Antoine Bosselut. 2025. https://doi.org/10.48550/arXiv.2510.25947 Revisiting Multilingual Data Mixtures in Language Model Pretraining . arXiv preprint. ArXiv:2510.25947 [cs]

work page doi:10.48550/arxiv.2510.25947 2025

[17] [17]

Baban Gain, Asif Ekbal, and Trilok Nath Singh. 2026. https://doi.org/10.48550/arXiv.2604.02881 One Model to Translate Them All ? A Journey to Mount Doom for Multilingual Model Merging . arXiv preprint. ArXiv:2604.02881 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.02881 2026

[18] [18]

Kevin Glocker, Kätriin Kukk, Romina Oji, Marcel Bollmann, Marco Kuhlmann, and Jenny Kunz. 2025. https://doi.org/10.48550/arXiv.2512.10772 Grow Up and Merge : Scaling Strategies for Efficient Language Adaptation . arXiv preprint. ArXiv:2512.10772 [cs]

work page doi:10.48550/arxiv.2512.10772 2025

[19] [19]

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.36 Arcee's MergeKit : A Toolkit for Merging Large Language Models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing : Ind...

work page doi:10.18653/v1/2024.emnlp-industry.36 2024

[20] [20]

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2023. https://openreview.net/forum?id=6t0Kwf8-jrj Editing models with task arithmetic . In The Eleventh International Conference on Learning Representations

2023

[21] [21]

Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, and Arianna Bisazza. 2026. https://doi.org/10.1162/TACL.a.600 MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs . Transactions of the Association for Computational Linguistics, 14:193--216

work page doi:10.1162/tacl.a.600 2026

[22] [22]

Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. 2025. https://doi.org/10.1145/3728458 Similarity of Neural Network Models : A Survey of Functional and Representational Measures . ACM Comput. Surv., 57(9):242:1--242:52

work page doi:10.1145/3728458 2025

[23] [23]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. https://proceedings.mlr.press/v97/kornblith19a.html Similarity of Neural Network Representations Revisited . In Proceedings of the 36th International Conference on Machine Learning , pages 3519--3529. PMLR

2019

[24] [24]

Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, and Thien Nguyen. 2023. https://doi.org/10.18653/v1/2023.emnlp-demo.28 Okapi: Instruction -tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.18653/v1/2023.emnlp-demo.28 2023

[25] [25]

Guillaume Lample, Alexis Conneau, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. https://openreview.net/forum?id=H196sainb Word translation without parallel data . In International Conference on Learning Representations

2018

[26] [26]

Smith, and Luke Zettlemoyer

Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. 2022. https://openreview.net/forum?id=SQgVgE2Sq4 Branch-train-merge: Embarrassingly parallel training of expert language models . In First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022

2022

[27] [27]

Bill Yuchen Lin, Seyeon Lee, Xiaoyang Qiao, and Xiang Ren. 2021. https://doi.org/10.18653/v1/2021.acl-long.102 Common Sense Beyond English : Evaluating and Improving Multilingual Language Models for Commonsense Reasoning . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference...

work page doi:10.18653/v1/2021.acl-long.102 2021

[28] [28]

Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin

Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. https://aclanthology.org/E17-2002/ URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume ...

2017

[29] [29]

Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Rayburn Caswell, Alex Pentland, Sercan O Arik, Chen-Yu Lee, and Sayna Ebrahimi. 2026. https://openreview.net/forum?id=0BkvUY61MX ATLAS : Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality . In The Fourteenth International ...

2026

[30] [30]

Guerreiro, Ricardo Rei, Duarte M

Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, and André F. T. Martins. 2024. https://doi.org/10.48550/arXiv.2409.16235 EuroLLM : Multilingual Language Models for Euro...

work page doi:10.48550/arxiv.2409.16235 2024

[31] [31]

Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi, Walid Irhaymi, Jin Yan, Tamara Serrano, Elena Pagliarini, Fritz Günther, and Evelina Leivada. 2026. https://doi.org/10.48550/arXiv.2602.20065 Multilingual Large Language Models do not comprehend all natural languages to equal degrees . arXiv preprint. ArXiv:2602.20065 [cs]

work page doi:10.48550/arxiv.2602.20065 2026

[32] [32]

OpenEuroLLM. 2025. https://openeurollm.eu/blog/hplt-oellm-38-reference-models Release of 38 Monolingual 2. 15B LLMs Trained on HPLT v2

2025

[33] [33]

Marinela Parović, Ivan Vulić, and Anna Korhonen. 2024. https://doi.org/10.18653/v1/2024.eacl-short.12 Investigating the Potential of Task Arithmetic for Cross - Lingual Transfer . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics ( Volume 2: Short Papers ) , pages 124--137, St. Julian's, Malta. ...

work page doi:10.18653/v1/2024.eacl-short.12 2024

[34] [34]

Maja Popović. 2017. https://doi.org/10.18653/v1/W17-4770 chrF ++: words helping character n-grams . In Proceedings of the Second Conference on Machine Translation , pages 612--618, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/w17-4770 2017

[35] [35]

Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, and 178 others. 2024. https://doi.org/10.48550/arXiv.2408.0...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024

[36] [36]

Alejandro R. Salamanca, Diana Abagyan, Daniel D'souza, Ammar Khairi, David Mora, Saurabh Dash, Viraat Aryabumi, Sara Rajaee, Mehrnaz Mofakhami, Ananya Sahu, Thomas Euyang, Brittawnya Prince, Madeline Smith, Hangyu Lin, Acyr Locatelli, Sara Hooker, Tom Kocmi, Aidan Gomez, Ivan Zhang, and 7 others. 2026. https://doi.org/10.48550/arXiv.2603.11510 Tiny Aya : ...

work page doi:10.48550/arxiv.2603.11510 2026

[37] [37]

Nour Shaheen, Sarath Chandar, Boris Knyazev, and Ekaterina Lobacheva. 2026. https://openreview.net/forum?id=FGbtxnaWk4 Is depth heterogeneity a barrier to model merging? In Third Workshop on Test-Time Updates (Main Track)

2026

[38] [38]

Chen Shani, Yuval Reif, Nathan Roll, Dan Jurafsky, and Ekaterina Shutova. 2026. https://doi.org/10.48550/arXiv.2601.07220 The Roots of Performance Disparity in Multilingual Language Models : Intrinsic Modeling Difficulty or Design Choices ? arXiv preprint. ArXiv:2601.07220 [cs] version: 2

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.07220 2026

[39] [39]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. https://doi.org/10.48550/arXiv.1909.08053 Megatron- LM : Training Multi - Billion Parameter Language Models Using Model Parallelism . arXiv preprint. ArXiv:1909.08053 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1909.08053 2020

[40] [40]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 others. 2023. https://doi.org/10.48550/arXiv.2307.09288 Llama 2...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

[41] [41]

Rob van der Goot, Esther Ploeger, Verena Blaschke, and Tanja Samardzic. 2025. https://doi.org/10.18653/v1/2025.emnlp-demos.23 D ista L s: a comprehensive collection of language distance measures . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 307--318, Suzhou, China. Association for...

work page doi:10.18653/v1/2025.emnlp-demos.23 2025

[42] [42]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. 2022. https://proceedings.mlr.press/v162/wortsman22a.html Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference ti...

2022

[43] [43]

Raffel, and Mohit Bansal

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A. Raffel, and Mohit Bansal. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1644c9af28ab7916874f6fd6228a9bcf-Abstract-Conference.html TIES - Merging : Resolving Interference When Merging Models . Advances in Neural Information Processing Systems, 36:7093--7115

2023

[44] [44]

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. 2024. https://doi.org/10.48550/arXiv.2408.07666 Model Merging in LLMs , MLLMs , and Beyond : Methods , Theories , Applications and Opportunities . arXiv preprint. ArXiv:2408.07666 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.07666 2024

[45] [45]

Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Ziyu Zhao, Daixin Wang, Qing Cui, Zhiqiang Zhang, Jun Zhou, Fei Wu, and Kun Kuang. 2025. https://doi.org/10.48550/arXiv.2502.06876 Mix Data or Merge Models ? Balancing the Helpfulness , Honesty , and Harmlessness of Large Language Model via Model Merging . arXiv preprint. ArXiv:2502.06876 [cs]

work page doi:10.48550/arxiv.2502.06876 2025

[46] [46]

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. 2024. https://openreview.net/forum?id=fq0NaiU8Ex Language models are super mario: Absorbing abilities from homologous models as a free lunch . In Forty-first International Conference on Machine Learning

2024

[47] [47]

Siqi Zeng, Yifei He, Weiqiu You, Yifan Hao, Yao-Hung Hubert Tsai, Makoto Yamada, and Han Zhao. 2025. https://doi.org/10.48550/arXiv.2502.01015 Efficient Model Editing with Task Vector Bases : A Theoretical Framework and Scalable Approach . arXiv preprint. ArXiv:2502.01015 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.01015 2025

[48] [48]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[49] [49]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...