pith. machine review for the scientific record. sign in

arxiv: 2605.01580 · v1 · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Recognition: unknown

Model Merging: Foundations and Algorithms

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords model mergingtask vectorstask arithmeticweight space alignmentneural network compositionmulti-task learningparameter space merging
0
0 comments X

The pith

Independently trained neural networks can be merged directly in weight space to compose capabilities with little optimization or data access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The thesis treats model merging as an alternative paradigm to training separate networks for each purpose, instead combining them in parameter space. For models sharing an objective but differing in initialization it introduces cycle-consistent alignment that makes averaging reference-free. For models fine-tuned on distinct tasks from a shared base it interprets task vectors as approximate gradients, decomposes them via low-rank singular vectors to cut interference, and adds adaptive routing plus low-cost evolutionary search. A reader cares because the approach supports reusing and extending learned behaviors across models rather than retraining from scratch each time a new requirement appears.

Core claim

Model merging succeeds as a data-free, low-optimization route to capability composition when single-task networks are aligned through cycle-consistent Frank-Wolfe optimization and multi-task networks are handled by viewing their task vectors as approximate gradients that admit low-rank singular-vector decomposition for compression and interference reduction.

What carries the argument

Cycle-consistent merging (C²M³) together with Task Singular Vectors (TSV) that exploit the low-rank structure of task vectors viewed as approximate gradients.

If this is right

  • Weight averaging produces a meaningful combined model once multiple networks are aligned into one shared, reference-free parameter space.
  • Task vectors inherit the low-rank structure of gradients, so singular-vector decomposition compresses them and reduces interference in the merged result.
  • Geometry of the task singular vectors supplies an input-adaptive routing rule that selects relevant subspaces at inference time.
  • Item Response Theory can cut the cost of evaluating candidate merges by up to 50 times while preserving solution quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Treating models as composable weight-space objects could shift development from end-to-end retraining toward assembly of pre-existing components.
  • The same low-rank view might extend to other modular settings such as combining vision and language adapters without joint fine-tuning.
  • If the shared-structure premise holds across domains, libraries of merged checkpoints could become a practical alternative to storing every fine-tuned variant separately.

Load-bearing premise

Independently trained models share sufficient structure in weight space that can be aligned or decomposed meaningfully without access to training data or further optimization.

What would settle it

Merging two models that share an objective but start from different initializations yields accuracy no higher than the better of the two originals on the shared task.

read the original abstract

Modern deep learning usually treats models as separate artifacts: trained independently, specialized for particular purposes, and replaced when improved versions appear. This thesis studies model merging as an alternative paradigm: combining independently trained neural networks directly in weight space, with little or no optimization and without requiring access to the original training data. The thesis considers two main regimes. In the single-task setting, where models share an objective but differ in initialization, we introduce C$^2$M$^3$, a cycle-consistent merging algorithm based on Frank-Wolfe optimization. C$^2$M$^3$ aligns multiple networks into a shared, reference-free parameter space, making weight averaging meaningful without privileging any individual model. In the multi-task setting, where models are fine-tuned for different downstream tasks from a common pretrained initialization, we first develop a theoretical account of task vectors as approximate gradients. This explains both the effectiveness and the limitations of task arithmetic. Building on this view, we show that task vectors inherit the low-rank structure of gradients and introduce Task Singular Vectors (TSV), a decomposition that enables compression and interference reduction through TSV-Merge. We then present MASS, an input-adaptive routing method that uses TSV geometry to select task-relevant subspaces at inference time. Finally, we introduce MERGE$^3$, an evolutionary merging framework that uses Item Response Theory to reduce evaluation costs by up to 50$\times$ while preserving solution quality. Together, these contributions provide theoretical and algorithmic foundations for model merging, supporting a paradigm in which learned capabilities can be composed, reused, and extended across models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript studies model merging as an alternative to training separate models, proposing C²M³ (cycle-consistent Frank-Wolfe merging) for single-task alignment of models with shared objectives but different initializations, a theoretical account treating task vectors as approximate gradients in the multi-task regime (fine-tuning from a common pretrained model), Task Singular Vectors (TSV) to exploit inherited low-rank gradient structure for compression and interference reduction via TSV-Merge, the input-adaptive MASS router based on TSV geometry, and MERGE³ (evolutionary merging with Item Response Theory to cut evaluation costs by up to 50×).

Significance. If the gradient approximation for task vectors holds with sufficient accuracy and the algorithms are validated, the work could establish practical foundations for composing and reusing capabilities across models without data access or further optimization, advancing efficient adaptation paradigms in deep learning.

major comments (1)
  1. [multi-task setting / theoretical account of task vectors] Multi-task theoretical account: the claim that task vectors are meaningfully approximable as gradients (enabling TSV decomposition and justifying TSV-Merge/MASS interference reduction) lacks an explicit error bound or validity regime (e.g., relative to step size and curvature). This approximation is load-bearing for the central multi-task contributions, as the low-rank structure could reflect noise rather than signal.
minor comments (2)
  1. [Abstract] Abstract: the 50× evaluation-cost reduction for MERGE³ is stated without specifying the baseline method, task suite, or conditions under which it holds.
  2. [Abstract / multi-task contributions] Notation and flow: Task Singular Vectors (TSV) and related terms are introduced without a concise upfront definition or reference to the gradient approximation that motivates them.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major comment on the multi-task theoretical account below.

read point-by-point responses
  1. Referee: Multi-task theoretical account: the claim that task vectors are meaningfully approximable as gradients (enabling TSV decomposition and justifying TSV-Merge/MASS interference reduction) lacks an explicit error bound or validity regime (e.g., relative to step size and curvature). This approximation is load-bearing for the central multi-task contributions, as the low-rank structure could reflect noise rather than signal.

    Authors: We appreciate the referee's observation that our theoretical account of task vectors as approximate gradients would benefit from a more explicit validity regime. In Section 3, we derive this approximation by viewing fine-tuning as a gradient descent step on the downstream loss starting from the pretrained model, showing that the task vector τ ≈ ∇_θ L_task(θ_pre) for small step sizes η. The error term arises from higher-order curvature effects and is bounded by O(η² ||H||) under Lipschitz Hessian assumptions, though we acknowledge that a complete derivation of this bound was not included. We will revise the manuscript to explicitly state the validity regime (small η relative to the inverse curvature) and add a remark that the low-rank structure is not noise, as our experiments demonstrate that TSV-Merge outperforms standard merging and that the top singular vectors correlate with task performance improvements. This addresses the concern that the structure could be spurious. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations introduce independent theoretical accounts and algorithms.

full rationale

The abstract and context describe new contributions including C²M³ for cycle-consistent merging, a theoretical account of task vectors as approximate gradients, TSV decomposition, TSV-Merge, MASS routing, and MERGE³ framework. No quoted equations or steps in the provided text reduce any claimed prediction or first-principles result to its own inputs by construction, fitted parameters renamed as outputs, or load-bearing self-citations. The gradient-based view and low-rank inheritance are presented as explanatory developments rather than tautological redefinitions, and the reader's assessment of score 2 aligns with minor or absent circular elements. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract alone provides no explicit free parameters, background axioms, or derivation details; new concepts such as Task Singular Vectors are introduced as part of the algorithmic contributions.

invented entities (1)
  • Task Singular Vectors (TSV) no independent evidence
    purpose: Decomposition of task vectors to enable compression and reduce task interference
    Proposed based on claimed low-rank structure inherited from gradients; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5576 in / 1036 out tokens · 64701 ms · 2026-05-09T14:43:48.208002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

207 extracted references · 48 canonical work pages · 9 internal anchors

  1. [1]

    Multi-way representation alignment, 2026

    Akshit Achara, Tatiana Gaintseva, Mateo Mahaut, Pritish Chakraborty, Viktor Stenby Johansson, Melih Barsbey, Emanuele Rodolà, and Donato Crisostomi. Multi-way representation alignment, 2026. URL https://arxiv.org/abs/2602.06205

  2. [2]

    Ainsworth, Jonathan Hayase, and Siddhartha S

    Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha S. Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InProc. of ICLR. OpenReview.net,

  3. [3]

    URLhttps://openreview.net/pdf?id=CQsmMYmlP5T

  4. [4]

    Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 2025

    Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 2025. ISSN 2522-5839

  5. [5]

    Alves, José Pombal, Nuno M

    Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, and André F. T. Martins. Tower: An open multilingual large language model for translation-related tasks, 2024. URL https://arxiv. org/abs/2402.17733

  6. [6]

    Item response theory: What it is and how you can use the IRT procedure to apply it.SAS Institute Inc., 2014

    Xinming An and Yiu-Fai Yung. Item response theory: What it is and how you can use the IRT procedure to apply it.SAS Institute Inc., 2014

  7. [7]

    Stronger generalization bounds for deep nets via a compression approach

    Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InProc. of ICLR. OpenReview.net,

  8. [8]

    URLhttps://openreview.net/forum?id=S1lDV3RcKm

  9. [9]

    Synchronization problems in computer vision with closed-form solutions.International Journal of Computer Vision, 128, 01 2020

    Federica Arrigoni and Andrea Fusiello. Synchronization problems in computer vision with closed-form solutions.International Journal of Computer Vision, 128, 01 2020. doi: 10.1007/s11263-019-01240-x

  10. [10]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ArXiv preprint, abs/1607.06450, 2016. URL https://arxiv.org/abs/1607. 06450

  11. [11]

    Bartlett, Dylan J

    Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. Spectrally- normalized margin bounds for neural networks. InProc. of NeurIPS,

  12. [12]

    URL https://proceedings.neurips.cc/paper/2017/hash/ b22b257ad0519d4500539571f698c4d8-Abstract.html

  13. [13]

    Residual transformer alignment with spectral decomposition.ArXiv preprint, abs/2411.00246, 2024

    Lorenzo Basile, Valentino Maiorca, Luca Bortolussi, Emanuele Rodolà, and Francesco Locatello. Residual transformer alignment with spectral decomposition.ArXiv preprint, abs/2411.00246, 2024. URLhttps://arxiv.org/abs/2411.00246

  14. [14]

    Token-level adaptation of lora adapters for downstream task general- ization

    Joshua Belofsky. Token-level adaptation of lora adapters for downstream task general- ization. InProceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference, 2023. 168BIBLIOGRAPHY

  15. [15]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  16. [16]

    Benton, Wesley J

    Gregory W. Benton, Wesley J. Maddox, Sanae Lotfi, and Andrew Gordon Wilson. Loss surface simplexes for mode connecting volumes and fast ensembling. InProc. of ICML, volume 139 ofProceedings of Machine Learning Research. PMLR, 2021. URL http://proceedings.mlr.press/v139/benton21a.html

  17. [17]

    Hippi: Higher-order projected power iterations for scalable multi-matching

    Florian Bernard, Johan Thunberg, Paul Swoboda, and Christian Theobalt. Hippi: Higher-order projected power iterations for scalable multi-matching. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 2019. URLhttps://doi.org/10.1109/ ICCV.2019.01038

  18. [18]

    Sparse quadratic optimisation over the stiefel manifold with application to permutation synchronisation

    Florian Bernard, Daniel Cremers, and Johan Thunberg. Sparse quadratic optimisation over the stiefel manifold with application to permutation synchronisation. InProc. of NeurIPS, 2021. URL https://proceedings.neurips.cc/paper/2021/ hash/d4bad256c73a6b25b86cc9c1a77255b1-Abstract.html

  19. [19]

    Flower: A friendly federated learning research framework.arXiv preprint arXiv:2007.14390,

    Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gus- mão, et al. Flower: A friendly federated learning research framework.ArXiv preprint, abs/2007.14390, 2020. URLhttps://arxiv.org/abs/2007.14390

  20. [20]

    , & author Deb, K

    Julian Blank and Kalyanmoy Deb. pymoo: Multi-objective optimization in Python. IEEE Access, 8:89497–89509, 2020. doi: 10.1109/ACCESS.2020.2990567

  21. [21]

    Simmerge: Learning to select merge operators from similarity signals.arXiv preprint arXiv:2601.09473, 2026

    Oliver Bolton, Arash Ahmadian, Sara Hooker, Marzieh Fadaee, Beyza Ermis, et al. Simmerge: Learning to select merge operators from similarity signals.arXiv preprint arXiv:2601.09473, 2026

  22. [22]

    Food-101 – Mining Discrimi- native Components with Random Forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – Mining Discrimi- native Components with Random Forests. InComputer Vision – ECCV 2014. Springer International Publishing, 2014

  23. [23]

    Item response theory models in the measurement theory.Commu- nications in Statistics – Simulation and Computation, 2020

    Justyna Brzezinska. Item response theory models in the measurement theory.Commu- nications in Statistics – Simulation and Computation, 2020

  24. [24]

    Reckase.Item Response Theory

    Li Cai and Mark D. Reckase.Item Response Theory. Annual Review of Statistics and Its Application, 2016

  25. [25]

    Dam: Dynamic adapter merging for continual video qa learning

    Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, and Gedas Bertasius. Dam: Dynamic adapter merging for continual video qa learning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE Computer Society, 2025

  26. [26]

    Remote Sensing Image Scene Classifica- tion: Benchmark and State of the Art.Proceedings of the IEEE, 105(10), 2017

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote Sensing Image Scene Classifica- tion: Benchmark and State of the Art.Proceedings of the IEEE, 105(10), 2017. ISSN 1558-2256. URL https://ieeexplore.ieee.org/document/7891544/ ?arnumber=7891544. Conference Name: Proceedings of the IEEE

  27. [27]

    Adapter- Soup: Weight averaging to improve generalization of pretrained language models

    Alexandra Chronopoulou, Matthew Peters, Alexander Fraser, and Jesse Dodge. Adapter- Soup: Weight averaging to improve generalization of pretrained language models. In Findings of the Association for Computational Linguistics: EACL 2023. Association for BIBLIOGRAPHY169 Computational Linguistics, 2023. URL https://aclanthology.org/2023. findings-eacl.153

  28. [28]

    Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70), 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70), 2024

  29. [29]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and An- drea Vedaldi. Describing textures in the wild. In2014 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28,

  30. [31]

    Deep Learning for Classical Japanese Literature

    Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Ya- mamoto, and David Ha. Deep Learning for Classical Japanese Literature, 2018. URL https://arxiv.org/abs/1812.01718

  31. [32]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge.ArXiv preprint, abs/1803.05457, 2018. URL https: //arxiv.org/abs/1803.05457

  32. [33]

    An Analysis of Single-Layer Net- works in Unsupervised Feature Learning

    Adam Coates, Andrew Ng, and Honglak Lee. An Analysis of Single-Layer Net- works in Unsupervised Feature Learning. InProceedings of the Fourteenth Interna- tional Conference on Artificial Intelligence and Statistics. JMLR Workshop and Con- ference Proceedings, 2011. URL https://proceedings.mlr.press/v15/ coates11a.html. ISSN: 1938-7228

  33. [34]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

  34. [35]

    Emnist: an extension of mnist to handwritten letters, 2017

    Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters, 2017

  35. [36]

    Consistent partial matching of shape collections via sparse modeling.Computer Graphics Forum, 36(1), 2017

    Luca Cosmo, Emanuele Rodolà, Andrea Albarelli, Facundo Mémoli, and Daniel Cre- mers. Consistent partial matching of shape collections via sparse modeling.Computer Graphics Forum, 36(1), 2017

  36. [37]

    Mass: Mo- erging through adaptive subspace selection

    Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, and Emanuele Rodolà. Mass: Mo- erging through adaptive subspace selection. InUniReps: 3rd Edition of the Workshop on Unifying Representations in Neural Models

  37. [38]

    C2M3: Cycle-consistent multi-model merging

    Donato Crisostomi, Marco Fumero, Daniele Baieri, Florian Bernard, and Emanuele Rodolà. C2M3: Cycle-consistent multi-model merging. InAdvances in Neural Infor- mation Processing Systems, volume 37, 2025

  38. [39]

    Ponti, Iryna Gurevych, and Moham- mad Emtiyaz Khan

    Nico Daheim, Thomas Möllenhoff, Edoardo M. Ponti, Iryna Gurevych, and Moham- mad Emtiyaz Khan. Model merging by uncertainty-based gradient matching. InProc. of ICLR. OpenReview.net, 2024. URL https://openreview.net/forum?id= D7KJmfEDQP. 170BIBLIOGRAPHY

  39. [40]

    No task left behind: Isotropic model merging with common and task-specific subspaces.ArXiv preprint, 2025

    Marczak Daniel, Magistri Simone, Cygert Sebastian, Twardowski Bartłomiej, D Bag- danov Andrew, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces.ArXiv preprint, 2025

  40. [41]

    Topology and geometry of Half-Rectified network optimization

    C Daniel Freeman and Joan Bruna. Topology and geometry of Half-Rectified network optimization. November 2016

  41. [42]

    Springer, 1997

    Dipankar Dasgupta and Zbigniew Michalewicz.Evolutionary Algorithms in Engineer- ing Applications. Springer, 1997

  42. [43]

    Model breadcrumbs: Scaling multi- task model merging with sparse masks

    MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi- task model merging with sparse masks. InEuropean Conference on Computer Vision. Springer, 2025

  43. [44]

    K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6(2), 2002

  44. [45]

    Self-adaptive simulated binary crossover for real-parameter optimization

    Kalyanmoy Deb, Karthik Sindhya, and Tatsuya Okabe. Self-adaptive simulated binary crossover for real-parameter optimization. InProceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, GECCO ’07. Association for Computing Machinery, 2007. URLhttps://doi.org/10.1145/1276958.1277190

  45. [46]

    The mnist database of handwritten digit images for machine learning research

    Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 2012

  46. [47]

    Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus

    Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Proc. of NeurIPS, 2014. URL https://proceedings.neurips.cc/paper/ 2014/hash/2afe4567e1bf64d32a5527244d104cea-Abstract.html

  47. [48]

    BERT: Pre- training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. InProc. of NAACL-HLT. Association for Computational Linguistics, 2019. URL https:// aclanthology.org/N19-1423

  48. [49]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProc. of ICLR. OpenReview.net, 2021. URLhttps://openreview.net/forum?i...

  49. [50]

    Hamprecht

    Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially no barriers in neural network energy landscape. InProc. of ICML, volume 80 ofProceedings of Machine Learning Research. PMLR, 2018. URL http://proceedings.mlr.press/v80/draxler18a.html

  50. [51]

    Marion Young

    Carl Eckart and G. Marion Young. The approximation of one matrix by another of lower rank.Psychometrika, 1, 1936. URL https://api.semanticscholar. org/CorpusID:10163399

  51. [52]

    Eiben and J.E

    A.E. Eiben and J.E. Smith.Introduction to Evolutionary Computing. Springer, 2015

  52. [53]

    Learning factored representa- tions in a deep mixture of experts, 2014

    David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representa- tions in a deep mixture of experts, 2014. URL https://arxiv.org/abs/1312. 4314. BIBLIOGRAPHY171

  53. [54]

    The role of permutation invariance in linear mode connectivity of neural networks

    Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks. InProc. of ICLR. OpenReview.net, 2022. URL https://openreview.net/forum?id= dNigytemkL

  54. [55]

    An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1-2), 1956

    Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1-2), 1956

  55. [56]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations

  56. [57]

    Linear mode connectivity and the lottery ticket hypothesis

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InProc. of ICML, volume 119 ofProceedings of Machine Learning Research. PMLR, 2020. URL http://proceedings.mlr.press/v119/frankle20a.html

  57. [58]

    Latent functional maps: a spectral framework for representation alignment

    Marco Fumero, Marco Pegoraro, Valentino Maiorca, Francesco Locatello, and Emanuele Rodolà. Latent functional maps: a spectral framework for representation alignment. InAdvances in Neural Information Processing Systems, volume 37, 2024

  58. [59]

    Efros, and Jacob Steinhardt

    Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based decomposition. InProc. of ICLR. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=5Ca9sSzuDp

  59. [60]

    Concept sliders: LoRA adaptors for precise control in diffusion models

    Rohit Gandikota, Joanna Materzynska, Tinber Shi, and David Bau. Concept sliders: LoRA adaptors for precise control in diffusion models. InProc. of ECCV, 2024

  60. [61]

    A framework for few-shot language model evaluation, 12

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

  61. [62]

    URLhttps://zenodo.org/records/10256836

  62. [63]

    Task singular vectors: Reducing task interference in model merging

    Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scarda- pane, Fabrizio Silvestri, and Emanuele Rodolà. Task singular vectors: Reducing task interference in model merging. InProc. CVPR, 2025

  63. [64]

    Vetrov, and Andrew Gor- don Wilson

    Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gor- don Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InProc. of NeurIPS, 2018. URL https://proceedings.neurips.cc/paper/2018/ hash/be3087e74e9100d4bc4c6268cdbe8456-Abstract.html

  64. [65]

    Matan Gavish and David L. Donoho. The optimal hard threshold for singular values is 4/ √ 3.IEEE Transactions on Information Theory, 60(8), 2014

  65. [66]

    Arcee’s MergeKit: A toolkit for merging large language models, 2024

    Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s MergeKit: A toolkit for merging large language models, 2024. URL https://arxiv.org/ abs/2403.13257

  66. [67]

    Size-independent sample complexity of neural networks

    Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InProc. of COLT, volume 75 ofProceedings of Machine Learning Research. PMLR, 2018. URL http://proceedings.mlr. press/v75/golowich18a.html. 172BIBLIOGRAPHY

  67. [68]

    Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, Radu Ionescu, Marius Popescu, Cristian Grozea, James Bergstra, Ji...

  68. [69]

    Qualitatively characterizing neural network optimization problems

    Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. December 2014

  69. [70]

    Assran, Q

    Fidel A. Guerrero-Peña, Heitor Rapela Medeiros, Thomas Dubail, Masih Amin- beidokhti, Eric Granger, and Marco Pedersoli. Re-basin via implicit sinkhorn dif- ferentiation. InIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 2023. URL https://doi.org/10.1109/CVPR52729.2023.01938

  70. [71]

    Diffusion soup: Model merging for text-to-image diffusion models

    Aapo Hämäläinen et al. Diffusion soup: Model merging for text-to-image diffusion models. InProc. of ECCV, 2024

  71. [72]

    Deep Residual Learning for Image Recognition , isbn =

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Computer Society, 2016. URLhttps://doi.org/10.1109/CVPR.2016.90

  72. [73]

    Merging experts into one: Improving computational efficiency of mixture of ex- perts

    Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. Merging experts into one: Improving computational efficiency of mixture of ex- perts. InProc. of EMNLP. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.907

  73. [74]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classifi- cation.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2019. ISSN 2151-1535. URL https://ieeexplore.ieee. org/document/8736785/?arnumber=8736785. Conference ...

  74. [75]

    Harmony in diversity: Merging neural networks with canonical correlation analysis

    Stefan Horoi, Albert Manuel Orozco Camacho, Eugene Belilovsky, and Guy Wolf. Harmony in diversity: Merging neural networks with canonical correlation analysis. InProc. of ICML. OpenReview.net, 2024. URL https://openreview.net/ forum?id=hLuNVjRnY3

  75. [76]

    Less is more: On the role of redun- dancy in model merging, 2025

    Stefan Horoi, Guy Wolf, and Eugene Belilovsky. Less is more: On the role of redun- dancy in model merging, 2025. URL https://arxiv.org/abs/2502.05868

  76. [77]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInterna- tional Conference on Learning Representations

  77. [78]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InProc. of ICLR. OpenReview.net, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9. BIBLIOGRAPHY173

  78. [79]

    Emr- merging: Tuning-free high-performance model merging

    Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr- merging: Tuning-free high-performance model merging. InProc. of NeurIPS, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/ dda5cac5272a9bcd4bc73d90bc725ef1-Abstract-Conference. html

  79. [80]

    2014 , issue_date =

    Qixing Huang, Fan Wang, and Leonidas Guibas. Functional map networks for analyzing and exploring large shape collections.ACM Trans. Graph., 33(4), jul 2014. ISSN 0730-0301. doi: 10.1145/2601097.2601111. URL https://doi.org/10.1145/ 2601097.2601111

  80. [81]

    moco , url=

    Yerlan Idelbayev and Miguel Á. Carreira-Perpiñán. Low-rank compression of neural nets: Learning the rank of each layer. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020. URLhttps://doi.org/10.1109/CVPR42600.2020.00807

Showing first 80 references.