arxiv: 2605.01580 · v1 · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Recognition: unknown

Model Merging: Foundations and Algorithms

Donato Crisostomi

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords model mergingtask vectorstask arithmeticweight space alignmentneural network compositionmulti-task learningparameter space merging

0 comments

The pith

Independently trained neural networks can be merged directly in weight space to compose capabilities with little optimization or data access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The thesis treats model merging as an alternative paradigm to training separate networks for each purpose, instead combining them in parameter space. For models sharing an objective but differing in initialization it introduces cycle-consistent alignment that makes averaging reference-free. For models fine-tuned on distinct tasks from a shared base it interprets task vectors as approximate gradients, decomposes them via low-rank singular vectors to cut interference, and adds adaptive routing plus low-cost evolutionary search. A reader cares because the approach supports reusing and extending learned behaviors across models rather than retraining from scratch each time a new requirement appears.

Core claim

Model merging succeeds as a data-free, low-optimization route to capability composition when single-task networks are aligned through cycle-consistent Frank-Wolfe optimization and multi-task networks are handled by viewing their task vectors as approximate gradients that admit low-rank singular-vector decomposition for compression and interference reduction.

What carries the argument

Cycle-consistent merging (C²M³) together with Task Singular Vectors (TSV) that exploit the low-rank structure of task vectors viewed as approximate gradients.

If this is right

Weight averaging produces a meaningful combined model once multiple networks are aligned into one shared, reference-free parameter space.
Task vectors inherit the low-rank structure of gradients, so singular-vector decomposition compresses them and reduces interference in the merged result.
Geometry of the task singular vectors supplies an input-adaptive routing rule that selects relevant subspaces at inference time.
Item Response Theory can cut the cost of evaluating candidate merges by up to 50 times while preserving solution quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Treating models as composable weight-space objects could shift development from end-to-end retraining toward assembly of pre-existing components.
The same low-rank view might extend to other modular settings such as combining vision and language adapters without joint fine-tuning.
If the shared-structure premise holds across domains, libraries of merged checkpoints could become a practical alternative to storing every fine-tuned variant separately.

Load-bearing premise

Independently trained models share sufficient structure in weight space that can be aligned or decomposed meaningfully without access to training data or further optimization.

What would settle it

Merging two models that share an objective but start from different initializations yields accuracy no higher than the better of the two originals on the shared task.

read the original abstract

Modern deep learning usually treats models as separate artifacts: trained independently, specialized for particular purposes, and replaced when improved versions appear. This thesis studies model merging as an alternative paradigm: combining independently trained neural networks directly in weight space, with little or no optimization and without requiring access to the original training data. The thesis considers two main regimes. In the single-task setting, where models share an objective but differ in initialization, we introduce C$^2$M$^3$, a cycle-consistent merging algorithm based on Frank-Wolfe optimization. C$^2$M$^3$ aligns multiple networks into a shared, reference-free parameter space, making weight averaging meaningful without privileging any individual model. In the multi-task setting, where models are fine-tuned for different downstream tasks from a common pretrained initialization, we first develop a theoretical account of task vectors as approximate gradients. This explains both the effectiveness and the limitations of task arithmetic. Building on this view, we show that task vectors inherit the low-rank structure of gradients and introduce Task Singular Vectors (TSV), a decomposition that enables compression and interference reduction through TSV-Merge. We then present MASS, an input-adaptive routing method that uses TSV geometry to select task-relevant subspaces at inference time. Finally, we introduce MERGE$^3$, an evolutionary merging framework that uses Item Response Theory to reduce evaluation costs by up to 50$\times$ while preserving solution quality. Together, these contributions provide theoretical and algorithmic foundations for model merging, supporting a paradigm in which learned capabilities can be composed, reused, and extended across models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This thesis adds several new merging algorithms and a gradient framing for task vectors, but without experiments or bounds the practical payoff stays unclear.

read the letter

The thesis introduces C²M³ for aligning single-task models via cycle-consistent Frank-Wolfe, plus TSV decomposition, MASS routing, and MERGE³ evolutionary search for the multi-task case. It also gives a theoretical account that treats task vectors as approximate gradients to explain why simple averaging sometimes works and where it breaks. That framing is the clearest new piece; it directly motivates the low-rank decomposition and the input-adaptive routing that follow from it. The algorithms themselves look like straightforward extensions once the gradient view is accepted, but they are distinct from prior task-arithmetic papers. Credit is due for trying to supply foundations rather than just another heuristic. The soft spot is exactly what the stress-test note flags: the gradient approximation is used to justify the low-rank structure and interference reduction, yet the abstract supplies no error term, no regime for step size or curvature, and no empirical check on whether the singular vectors capture signal or noise. Without those, TSV-Merge and MASS rest on an unverified assumption that independently trained models share enough aligned structure in weight space. The lack of any reported results, baselines, or implementation details makes it impossible to judge whether the claimed 50× evaluation saving or the compression benefits actually appear in practice. This work is for people already working on model merging or modular composition who want fresh algorithmic ideas and a story that ties them together. A reader in that niche can extract the new methods and the gradient perspective even if they have to supply their own experiments later. It deserves a serious referee because the contributions are concrete and the thesis format leaves space to add the missing validation. I would recommend sending it to peer review, with the clear expectation that experiments and a bounded analysis of the approximation will be required.

Referee Report

1 major / 2 minor

Summary. The manuscript studies model merging as an alternative to training separate models, proposing C²M³ (cycle-consistent Frank-Wolfe merging) for single-task alignment of models with shared objectives but different initializations, a theoretical account treating task vectors as approximate gradients in the multi-task regime (fine-tuning from a common pretrained model), Task Singular Vectors (TSV) to exploit inherited low-rank gradient structure for compression and interference reduction via TSV-Merge, the input-adaptive MASS router based on TSV geometry, and MERGE³ (evolutionary merging with Item Response Theory to cut evaluation costs by up to 50×).

Significance. If the gradient approximation for task vectors holds with sufficient accuracy and the algorithms are validated, the work could establish practical foundations for composing and reusing capabilities across models without data access or further optimization, advancing efficient adaptation paradigms in deep learning.

major comments (1)

[multi-task setting / theoretical account of task vectors] Multi-task theoretical account: the claim that task vectors are meaningfully approximable as gradients (enabling TSV decomposition and justifying TSV-Merge/MASS interference reduction) lacks an explicit error bound or validity regime (e.g., relative to step size and curvature). This approximation is load-bearing for the central multi-task contributions, as the low-rank structure could reflect noise rather than signal.

minor comments (2)

[Abstract] Abstract: the 50× evaluation-cost reduction for MERGE³ is stated without specifying the baseline method, task suite, or conditions under which it holds.
[Abstract / multi-task contributions] Notation and flow: Task Singular Vectors (TSV) and related terms are introduced without a concise upfront definition or reference to the gradient approximation that motivates them.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major comment on the multi-task theoretical account below.

read point-by-point responses

Referee: Multi-task theoretical account: the claim that task vectors are meaningfully approximable as gradients (enabling TSV decomposition and justifying TSV-Merge/MASS interference reduction) lacks an explicit error bound or validity regime (e.g., relative to step size and curvature). This approximation is load-bearing for the central multi-task contributions, as the low-rank structure could reflect noise rather than signal.

Authors: We appreciate the referee's observation that our theoretical account of task vectors as approximate gradients would benefit from a more explicit validity regime. In Section 3, we derive this approximation by viewing fine-tuning as a gradient descent step on the downstream loss starting from the pretrained model, showing that the task vector τ ≈ ∇_θ L_task(θ_pre) for small step sizes η. The error term arises from higher-order curvature effects and is bounded by O(η² ||H||) under Lipschitz Hessian assumptions, though we acknowledge that a complete derivation of this bound was not included. We will revise the manuscript to explicitly state the validity regime (small η relative to the inverse curvature) and add a remark that the low-rank structure is not noise, as our experiments demonstrate that TSV-Merge outperforms standard merging and that the top singular vectors correlate with task performance improvements. This addresses the concern that the structure could be spurious. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations introduce independent theoretical accounts and algorithms.

full rationale

The abstract and context describe new contributions including C²M³ for cycle-consistent merging, a theoretical account of task vectors as approximate gradients, TSV decomposition, TSV-Merge, MASS routing, and MERGE³ framework. No quoted equations or steps in the provided text reduce any claimed prediction or first-principles result to its own inputs by construction, fitted parameters renamed as outputs, or load-bearing self-citations. The gradient-based view and low-rank inheritance are presented as explanatory developments rather than tautological redefinitions, and the reader's assessment of score 2 aligns with minor or absent circular elements. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract alone provides no explicit free parameters, background axioms, or derivation details; new concepts such as Task Singular Vectors are introduced as part of the algorithmic contributions.

invented entities (1)

Task Singular Vectors (TSV) no independent evidence
purpose: Decomposition of task vectors to enable compression and reduce task interference
Proposed based on claimed low-rank structure inherited from gradients; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5576 in / 1036 out tokens · 64701 ms · 2026-05-09T14:43:48.208002+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

207 extracted references · 48 canonical work pages · 9 internal anchors

[1]

Multi-way representation alignment, 2026

Akshit Achara, Tatiana Gaintseva, Mateo Mahaut, Pritish Chakraborty, Viktor Stenby Johansson, Melih Barsbey, Emanuele Rodolà, and Donato Crisostomi. Multi-way representation alignment, 2026. URL https://arxiv.org/abs/2602.06205

work page arXiv 2026
[2]

Ainsworth, Jonathan Hayase, and Siddhartha S

Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha S. Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InProc. of ICLR. OpenReview.net,
[3]

URLhttps://openreview.net/pdf?id=CQsmMYmlP5T
[4]

Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 2025

Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 2025. ISSN 2522-5839

2025
[5]

Alves, José Pombal, Nuno M

Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, and André F. T. Martins. Tower: An open multilingual large language model for translation-related tasks, 2024. URL https://arxiv. org/abs/2402.17733

work page arXiv 2024
[6]

Item response theory: What it is and how you can use the IRT procedure to apply it.SAS Institute Inc., 2014

Xinming An and Yiu-Fai Yung. Item response theory: What it is and how you can use the IRT procedure to apply it.SAS Institute Inc., 2014

2014
[7]

Stronger generalization bounds for deep nets via a compression approach

Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InProc. of ICLR. OpenReview.net,
[8]

URLhttps://openreview.net/forum?id=S1lDV3RcKm
[9]

Synchronization problems in computer vision with closed-form solutions.International Journal of Computer Vision, 128, 01 2020

Federica Arrigoni and Andrea Fusiello. Synchronization problems in computer vision with closed-form solutions.International Journal of Computer Vision, 128, 01 2020. doi: 10.1007/s11263-019-01240-x

work page doi:10.1007/s11263-019-01240-x 2020
[10]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ArXiv preprint, abs/1607.06450, 2016. URL https://arxiv.org/abs/1607. 06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Bartlett, Dylan J

Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. Spectrally- normalized margin bounds for neural networks. InProc. of NeurIPS,
[12]

URL https://proceedings.neurips.cc/paper/2017/hash/ b22b257ad0519d4500539571f698c4d8-Abstract.html

2017
[13]

Residual transformer alignment with spectral decomposition.ArXiv preprint, abs/2411.00246, 2024

Lorenzo Basile, Valentino Maiorca, Luca Bortolussi, Emanuele Rodolà, and Francesco Locatello. Residual transformer alignment with spectral decomposition.ArXiv preprint, abs/2411.00246, 2024. URLhttps://arxiv.org/abs/2411.00246

work page arXiv 2024
[14]

Token-level adaptation of lora adapters for downstream task general- ization

Joshua Belofsky. Token-level adaptation of lora adapters for downstream task general- ization. InProceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference, 2023. 168BIBLIOGRAPHY

2023
[15]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review arXiv 2013
[16]

Benton, Wesley J

Gregory W. Benton, Wesley J. Maddox, Sanae Lotfi, and Andrew Gordon Wilson. Loss surface simplexes for mode connecting volumes and fast ensembling. InProc. of ICML, volume 139 ofProceedings of Machine Learning Research. PMLR, 2021. URL http://proceedings.mlr.press/v139/benton21a.html

2021
[17]

Hippi: Higher-order projected power iterations for scalable multi-matching

Florian Bernard, Johan Thunberg, Paul Swoboda, and Christian Theobalt. Hippi: Higher-order projected power iterations for scalable multi-matching. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 2019. URLhttps://doi.org/10.1109/ ICCV.2019.01038

work page arXiv 2019
[18]

Sparse quadratic optimisation over the stiefel manifold with application to permutation synchronisation

Florian Bernard, Daniel Cremers, and Johan Thunberg. Sparse quadratic optimisation over the stiefel manifold with application to permutation synchronisation. InProc. of NeurIPS, 2021. URL https://proceedings.neurips.cc/paper/2021/ hash/d4bad256c73a6b25b86cc9c1a77255b1-Abstract.html

2021
[19]

Flower: A friendly federated learning research framework.arXiv preprint arXiv:2007.14390,

Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gus- mão, et al. Flower: A friendly federated learning research framework.ArXiv preprint, abs/2007.14390, 2020. URLhttps://arxiv.org/abs/2007.14390

work page arXiv 2007
[20]

, & author Deb, K

Julian Blank and Kalyanmoy Deb. pymoo: Multi-objective optimization in Python. IEEE Access, 8:89497–89509, 2020. doi: 10.1109/ACCESS.2020.2990567

work page doi:10.1109/access.2020.2990567 2020
[21]

Simmerge: Learning to select merge operators from similarity signals.arXiv preprint arXiv:2601.09473, 2026

Oliver Bolton, Arash Ahmadian, Sara Hooker, Marzieh Fadaee, Beyza Ermis, et al. Simmerge: Learning to select merge operators from similarity signals.arXiv preprint arXiv:2601.09473, 2026

work page arXiv 2026
[22]

Food-101 – Mining Discrimi- native Components with Random Forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – Mining Discrimi- native Components with Random Forests. InComputer Vision – ECCV 2014. Springer International Publishing, 2014

2014
[23]

Item response theory models in the measurement theory.Commu- nications in Statistics – Simulation and Computation, 2020

Justyna Brzezinska. Item response theory models in the measurement theory.Commu- nications in Statistics – Simulation and Computation, 2020

2020
[24]

Reckase.Item Response Theory

Li Cai and Mark D. Reckase.Item Response Theory. Annual Review of Statistics and Its Application, 2016

2016
[25]

Dam: Dynamic adapter merging for continual video qa learning

Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, and Gedas Bertasius. Dam: Dynamic adapter merging for continual video qa learning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE Computer Society, 2025

2025
[26]

Remote Sensing Image Scene Classifica- tion: Benchmark and State of the Art.Proceedings of the IEEE, 105(10), 2017

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote Sensing Image Scene Classifica- tion: Benchmark and State of the Art.Proceedings of the IEEE, 105(10), 2017. ISSN 1558-2256. URL https://ieeexplore.ieee.org/document/7891544/ ?arnumber=7891544. Conference Name: Proceedings of the IEEE

work page arXiv 2017
[27]

Adapter- Soup: Weight averaging to improve generalization of pretrained language models

Alexandra Chronopoulou, Matthew Peters, Alexander Fraser, and Jesse Dodge. Adapter- Soup: Weight averaging to improve generalization of pretrained language models. In Findings of the Association for Computational Linguistics: EACL 2023. Association for BIBLIOGRAPHY169 Computational Linguistics, 2023. URL https://aclanthology.org/2023. findings-eacl.153

2023
[28]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70), 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70), 2024

2024
[29]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and An- drea Vedaldi. Describing textures in the wild. In2014 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28,

2014
[31]

Deep Learning for Classical Japanese Literature

Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Ya- mamoto, and David Ha. Deep Learning for Classical Japanese Literature, 2018. URL https://arxiv.org/abs/1812.01718

work page Pith review arXiv 2018
[32]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge.ArXiv preprint, abs/1803.05457, 2018. URL https: //arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

An Analysis of Single-Layer Net- works in Unsupervised Feature Learning

Adam Coates, Andrew Ng, and Honglak Lee. An Analysis of Single-Layer Net- works in Unsupervised Feature Learning. InProceedings of the Fourteenth Interna- tional Conference on Artificial Intelligence and Statistics. JMLR Workshop and Con- ference Proceedings, 2011. URL https://proceedings.mlr.press/v15/ coates11a.html. ISSN: 1938-7228

2011
[34]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Emnist: an extension of mnist to handwritten letters, 2017

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters, 2017

2017
[36]

Consistent partial matching of shape collections via sparse modeling.Computer Graphics Forum, 36(1), 2017

Luca Cosmo, Emanuele Rodolà, Andrea Albarelli, Facundo Mémoli, and Daniel Cre- mers. Consistent partial matching of shape collections via sparse modeling.Computer Graphics Forum, 36(1), 2017

2017
[37]

Mass: Mo- erging through adaptive subspace selection

Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, and Emanuele Rodolà. Mass: Mo- erging through adaptive subspace selection. InUniReps: 3rd Edition of the Workshop on Unifying Representations in Neural Models
[38]

C2M3: Cycle-consistent multi-model merging

Donato Crisostomi, Marco Fumero, Daniele Baieri, Florian Bernard, and Emanuele Rodolà. C2M3: Cycle-consistent multi-model merging. InAdvances in Neural Infor- mation Processing Systems, volume 37, 2025

2025
[39]

Ponti, Iryna Gurevych, and Moham- mad Emtiyaz Khan

Nico Daheim, Thomas Möllenhoff, Edoardo M. Ponti, Iryna Gurevych, and Moham- mad Emtiyaz Khan. Model merging by uncertainty-based gradient matching. InProc. of ICLR. OpenReview.net, 2024. URL https://openreview.net/forum?id= D7KJmfEDQP. 170BIBLIOGRAPHY

2024
[40]

No task left behind: Isotropic model merging with common and task-specific subspaces.ArXiv preprint, 2025

Marczak Daniel, Magistri Simone, Cygert Sebastian, Twardowski Bartłomiej, D Bag- danov Andrew, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces.ArXiv preprint, 2025

2025
[41]

Topology and geometry of Half-Rectified network optimization

C Daniel Freeman and Joan Bruna. Topology and geometry of Half-Rectified network optimization. November 2016

2016
[42]

Springer, 1997

Dipankar Dasgupta and Zbigniew Michalewicz.Evolutionary Algorithms in Engineer- ing Applications. Springer, 1997

1997
[43]

Model breadcrumbs: Scaling multi- task model merging with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi- task model merging with sparse masks. InEuropean Conference on Computer Vision. Springer, 2025

2025
[44]

K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6(2), 2002

2002
[45]

Self-adaptive simulated binary crossover for real-parameter optimization

Kalyanmoy Deb, Karthik Sindhya, and Tatsuya Okabe. Self-adaptive simulated binary crossover for real-parameter optimization. InProceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, GECCO ’07. Association for Computing Machinery, 2007. URLhttps://doi.org/10.1145/1276958.1277190

work page doi:10.1145/1276958.1277190 2007
[46]

The mnist database of handwritten digit images for machine learning research

Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 2012

2012
[47]

Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus

Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Proc. of NeurIPS, 2014. URL https://proceedings.neurips.cc/paper/ 2014/hash/2afe4567e1bf64d32a5527244d104cea-Abstract.html

2014
[48]

BERT: Pre- training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. InProc. of NAACL-HLT. Association for Computational Linguistics, 2019. URL https:// aclanthology.org/N19-1423

2019
[49]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProc. of ICLR. OpenReview.net, 2021. URLhttps://openreview.net/forum?i...

2021
[50]

Hamprecht

Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially no barriers in neural network energy landscape. InProc. of ICML, volume 80 ofProceedings of Machine Learning Research. PMLR, 2018. URL http://proceedings.mlr.press/v80/draxler18a.html

2018
[51]

Marion Young

Carl Eckart and G. Marion Young. The approximation of one matrix by another of lower rank.Psychometrika, 1, 1936. URL https://api.semanticscholar. org/CorpusID:10163399

1936
[52]

Eiben and J.E

A.E. Eiben and J.E. Smith.Introduction to Evolutionary Computing. Springer, 2015

2015
[53]

Learning factored representa- tions in a deep mixture of experts, 2014

David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representa- tions in a deep mixture of experts, 2014. URL https://arxiv.org/abs/1312. 4314. BIBLIOGRAPHY171

2014
[54]

The role of permutation invariance in linear mode connectivity of neural networks

Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks. InProc. of ICLR. OpenReview.net, 2022. URL https://openreview.net/forum?id= dNigytemkL

2022
[55]

An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1-2), 1956

Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1-2), 1956

1956
[56]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations
[57]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InProc. of ICML, volume 119 ofProceedings of Machine Learning Research. PMLR, 2020. URL http://proceedings.mlr.press/v119/frankle20a.html

2020
[58]

Latent functional maps: a spectral framework for representation alignment

Marco Fumero, Marco Pegoraro, Valentino Maiorca, Francesco Locatello, and Emanuele Rodolà. Latent functional maps: a spectral framework for representation alignment. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024
[59]

Efros, and Jacob Steinhardt

Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based decomposition. InProc. of ICLR. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=5Ca9sSzuDp

2024
[60]

Concept sliders: LoRA adaptors for precise control in diffusion models

Rohit Gandikota, Joanna Materzynska, Tinber Shi, and David Bau. Concept sliders: LoRA adaptors for precise control in diffusion models. InProc. of ECCV, 2024

2024
[61]

A framework for few-shot language model evaluation, 12

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...
[62]

URLhttps://zenodo.org/records/10256836

work page arXiv
[63]

Task singular vectors: Reducing task interference in model merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scarda- pane, Fabrizio Silvestri, and Emanuele Rodolà. Task singular vectors: Reducing task interference in model merging. InProc. CVPR, 2025

2025
[64]

Vetrov, and Andrew Gor- don Wilson

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gor- don Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InProc. of NeurIPS, 2018. URL https://proceedings.neurips.cc/paper/2018/ hash/be3087e74e9100d4bc4c6268cdbe8456-Abstract.html

2018
[65]

Matan Gavish and David L. Donoho. The optimal hard threshold for singular values is 4/ √ 3.IEEE Transactions on Information Theory, 60(8), 2014

2014
[66]

Arcee’s MergeKit: A toolkit for merging large language models, 2024

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s MergeKit: A toolkit for merging large language models, 2024. URL https://arxiv.org/ abs/2403.13257

work page arXiv 2024
[67]

Size-independent sample complexity of neural networks

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InProc. of COLT, volume 75 ofProceedings of Machine Learning Research. PMLR, 2018. URL http://proceedings.mlr. press/v75/golowich18a.html. 172BIBLIOGRAPHY

2018
[68]

Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, Radu Ionescu, Marius Popescu, Cristian Grozea, James Bergstra, Ji...

2013
[69]

Qualitatively characterizing neural network optimization problems

Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. December 2014

2014
[70]

Assran, Q

Fidel A. Guerrero-Peña, Heitor Rapela Medeiros, Thomas Dubail, Masih Amin- beidokhti, Eric Granger, and Marco Pedersoli. Re-basin via implicit sinkhorn dif- ferentiation. InIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 2023. URL https://doi.org/10.1109/CVPR52729.2023.01938

work page doi:10.1109/cvpr52729.2023.01938 2023
[71]

Diffusion soup: Model merging for text-to-image diffusion models

Aapo Hämäläinen et al. Diffusion soup: Model merging for text-to-image diffusion models. InProc. of ECCV, 2024

2024
[72]

Deep Residual Learning for Image Recognition , isbn =

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Computer Society, 2016. URLhttps://doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[73]

Merging experts into one: Improving computational efficiency of mixture of ex- perts

Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. Merging experts into one: Improving computational efficiency of mixture of ex- perts. InProc. of EMNLP. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.907

2023
[74]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classifi- cation.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2019. ISSN 2151-1535. URL https://ieeexplore.ieee. org/document/8736785/?arnumber=8736785. Conference ...

work page arXiv 2019
[75]

Harmony in diversity: Merging neural networks with canonical correlation analysis

Stefan Horoi, Albert Manuel Orozco Camacho, Eugene Belilovsky, and Guy Wolf. Harmony in diversity: Merging neural networks with canonical correlation analysis. InProc. of ICML. OpenReview.net, 2024. URL https://openreview.net/ forum?id=hLuNVjRnY3

2024
[76]

Less is more: On the role of redun- dancy in model merging, 2025

Stefan Horoi, Guy Wolf, and Eugene Belilovsky. Less is more: On the role of redun- dancy in model merging, 2025. URL https://arxiv.org/abs/2502.05868

work page arXiv 2025
[77]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInterna- tional Conference on Learning Representations
[78]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InProc. of ICLR. OpenReview.net, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9. BIBLIOGRAPHY173

2022
[79]

Emr- merging: Tuning-free high-performance model merging

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr- merging: Tuning-free high-performance model merging. InProc. of NeurIPS, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/ dda5cac5272a9bcd4bc73d90bc725ef1-Abstract-Conference. html

2024
[80]

2014 , issue_date =

Qixing Huang, Fan Wang, and Leonidas Guibas. Functional map networks for analyzing and exploring large shape collections.ACM Trans. Graph., 33(4), jul 2014. ISSN 0730-0301. doi: 10.1145/2601097.2601111. URL https://doi.org/10.1145/ 2601097.2601111

work page doi:10.1145/2601097.2601111 2014
[81]

moco , url=

Yerlan Idelbayev and Miguel Á. Carreira-Perpiñán. Low-rank compression of neural nets: Learning the rank of each layer. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020. URLhttps://doi.org/10.1109/CVPR42600.2020.00807

work page doi:10.1109/cvpr42600.2020.00807 2020

Showing first 80 references.