Recognition: unknown
Model Merging: Foundations and Algorithms
Pith reviewed 2026-05-09 14:43 UTC · model grok-4.3
The pith
Independently trained neural networks can be merged directly in weight space to compose capabilities with little optimization or data access.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Model merging succeeds as a data-free, low-optimization route to capability composition when single-task networks are aligned through cycle-consistent Frank-Wolfe optimization and multi-task networks are handled by viewing their task vectors as approximate gradients that admit low-rank singular-vector decomposition for compression and interference reduction.
What carries the argument
Cycle-consistent merging (C²M³) together with Task Singular Vectors (TSV) that exploit the low-rank structure of task vectors viewed as approximate gradients.
If this is right
- Weight averaging produces a meaningful combined model once multiple networks are aligned into one shared, reference-free parameter space.
- Task vectors inherit the low-rank structure of gradients, so singular-vector decomposition compresses them and reduces interference in the merged result.
- Geometry of the task singular vectors supplies an input-adaptive routing rule that selects relevant subspaces at inference time.
- Item Response Theory can cut the cost of evaluating candidate merges by up to 50 times while preserving solution quality.
Where Pith is reading between the lines
- Treating models as composable weight-space objects could shift development from end-to-end retraining toward assembly of pre-existing components.
- The same low-rank view might extend to other modular settings such as combining vision and language adapters without joint fine-tuning.
- If the shared-structure premise holds across domains, libraries of merged checkpoints could become a practical alternative to storing every fine-tuned variant separately.
Load-bearing premise
Independently trained models share sufficient structure in weight space that can be aligned or decomposed meaningfully without access to training data or further optimization.
What would settle it
Merging two models that share an objective but start from different initializations yields accuracy no higher than the better of the two originals on the shared task.
read the original abstract
Modern deep learning usually treats models as separate artifacts: trained independently, specialized for particular purposes, and replaced when improved versions appear. This thesis studies model merging as an alternative paradigm: combining independently trained neural networks directly in weight space, with little or no optimization and without requiring access to the original training data. The thesis considers two main regimes. In the single-task setting, where models share an objective but differ in initialization, we introduce C$^2$M$^3$, a cycle-consistent merging algorithm based on Frank-Wolfe optimization. C$^2$M$^3$ aligns multiple networks into a shared, reference-free parameter space, making weight averaging meaningful without privileging any individual model. In the multi-task setting, where models are fine-tuned for different downstream tasks from a common pretrained initialization, we first develop a theoretical account of task vectors as approximate gradients. This explains both the effectiveness and the limitations of task arithmetic. Building on this view, we show that task vectors inherit the low-rank structure of gradients and introduce Task Singular Vectors (TSV), a decomposition that enables compression and interference reduction through TSV-Merge. We then present MASS, an input-adaptive routing method that uses TSV geometry to select task-relevant subspaces at inference time. Finally, we introduce MERGE$^3$, an evolutionary merging framework that uses Item Response Theory to reduce evaluation costs by up to 50$\times$ while preserving solution quality. Together, these contributions provide theoretical and algorithmic foundations for model merging, supporting a paradigm in which learned capabilities can be composed, reused, and extended across models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies model merging as an alternative to training separate models, proposing C²M³ (cycle-consistent Frank-Wolfe merging) for single-task alignment of models with shared objectives but different initializations, a theoretical account treating task vectors as approximate gradients in the multi-task regime (fine-tuning from a common pretrained model), Task Singular Vectors (TSV) to exploit inherited low-rank gradient structure for compression and interference reduction via TSV-Merge, the input-adaptive MASS router based on TSV geometry, and MERGE³ (evolutionary merging with Item Response Theory to cut evaluation costs by up to 50×).
Significance. If the gradient approximation for task vectors holds with sufficient accuracy and the algorithms are validated, the work could establish practical foundations for composing and reusing capabilities across models without data access or further optimization, advancing efficient adaptation paradigms in deep learning.
major comments (1)
- [multi-task setting / theoretical account of task vectors] Multi-task theoretical account: the claim that task vectors are meaningfully approximable as gradients (enabling TSV decomposition and justifying TSV-Merge/MASS interference reduction) lacks an explicit error bound or validity regime (e.g., relative to step size and curvature). This approximation is load-bearing for the central multi-task contributions, as the low-rank structure could reflect noise rather than signal.
minor comments (2)
- [Abstract] Abstract: the 50× evaluation-cost reduction for MERGE³ is stated without specifying the baseline method, task suite, or conditions under which it holds.
- [Abstract / multi-task contributions] Notation and flow: Task Singular Vectors (TSV) and related terms are introduced without a concise upfront definition or reference to the gradient approximation that motivates them.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address the major comment on the multi-task theoretical account below.
read point-by-point responses
-
Referee: Multi-task theoretical account: the claim that task vectors are meaningfully approximable as gradients (enabling TSV decomposition and justifying TSV-Merge/MASS interference reduction) lacks an explicit error bound or validity regime (e.g., relative to step size and curvature). This approximation is load-bearing for the central multi-task contributions, as the low-rank structure could reflect noise rather than signal.
Authors: We appreciate the referee's observation that our theoretical account of task vectors as approximate gradients would benefit from a more explicit validity regime. In Section 3, we derive this approximation by viewing fine-tuning as a gradient descent step on the downstream loss starting from the pretrained model, showing that the task vector τ ≈ ∇_θ L_task(θ_pre) for small step sizes η. The error term arises from higher-order curvature effects and is bounded by O(η² ||H||) under Lipschitz Hessian assumptions, though we acknowledge that a complete derivation of this bound was not included. We will revise the manuscript to explicitly state the validity regime (small η relative to the inverse curvature) and add a remark that the low-rank structure is not noise, as our experiments demonstrate that TSV-Merge outperforms standard merging and that the top singular vectors correlate with task performance improvements. This addresses the concern that the structure could be spurious. revision: yes
Circularity Check
No significant circularity; derivations introduce independent theoretical accounts and algorithms.
full rationale
The abstract and context describe new contributions including C²M³ for cycle-consistent merging, a theoretical account of task vectors as approximate gradients, TSV decomposition, TSV-Merge, MASS routing, and MERGE³ framework. No quoted equations or steps in the provided text reduce any claimed prediction or first-principles result to its own inputs by construction, fitted parameters renamed as outputs, or load-bearing self-citations. The gradient-based view and low-rank inheritance are presented as explanatory developments rather than tautological redefinitions, and the reader's assessment of score 2 aligns with minor or absent circular elements. The chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Task Singular Vectors (TSV)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Multi-way representation alignment, 2026
Akshit Achara, Tatiana Gaintseva, Mateo Mahaut, Pritish Chakraborty, Viktor Stenby Johansson, Melih Barsbey, Emanuele Rodolà, and Donato Crisostomi. Multi-way representation alignment, 2026. URL https://arxiv.org/abs/2602.06205
-
[2]
Ainsworth, Jonathan Hayase, and Siddhartha S
Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha S. Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InProc. of ICLR. OpenReview.net,
-
[3]
URLhttps://openreview.net/pdf?id=CQsmMYmlP5T
-
[4]
Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 2025
Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 2025. ISSN 2522-5839
2025
-
[5]
Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, and André F. T. Martins. Tower: An open multilingual large language model for translation-related tasks, 2024. URL https://arxiv. org/abs/2402.17733
-
[6]
Item response theory: What it is and how you can use the IRT procedure to apply it.SAS Institute Inc., 2014
Xinming An and Yiu-Fai Yung. Item response theory: What it is and how you can use the IRT procedure to apply it.SAS Institute Inc., 2014
2014
-
[7]
Stronger generalization bounds for deep nets via a compression approach
Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. InProc. of ICLR. OpenReview.net,
-
[8]
URLhttps://openreview.net/forum?id=S1lDV3RcKm
-
[9]
Federica Arrigoni and Andrea Fusiello. Synchronization problems in computer vision with closed-form solutions.International Journal of Computer Vision, 128, 01 2020. doi: 10.1007/s11263-019-01240-x
-
[10]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ArXiv preprint, abs/1607.06450, 2016. URL https://arxiv.org/abs/1607. 06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
Bartlett, Dylan J
Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. Spectrally- normalized margin bounds for neural networks. InProc. of NeurIPS,
-
[12]
URL https://proceedings.neurips.cc/paper/2017/hash/ b22b257ad0519d4500539571f698c4d8-Abstract.html
2017
-
[13]
Residual transformer alignment with spectral decomposition.ArXiv preprint, abs/2411.00246, 2024
Lorenzo Basile, Valentino Maiorca, Luca Bortolussi, Emanuele Rodolà, and Francesco Locatello. Residual transformer alignment with spectral decomposition.ArXiv preprint, abs/2411.00246, 2024. URLhttps://arxiv.org/abs/2411.00246
-
[14]
Token-level adaptation of lora adapters for downstream task general- ization
Joshua Belofsky. Token-level adaptation of lora adapters for downstream task general- ization. InProceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference, 2023. 168BIBLIOGRAPHY
2023
-
[15]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review arXiv 2013
-
[16]
Benton, Wesley J
Gregory W. Benton, Wesley J. Maddox, Sanae Lotfi, and Andrew Gordon Wilson. Loss surface simplexes for mode connecting volumes and fast ensembling. InProc. of ICML, volume 139 ofProceedings of Machine Learning Research. PMLR, 2021. URL http://proceedings.mlr.press/v139/benton21a.html
2021
-
[17]
Hippi: Higher-order projected power iterations for scalable multi-matching
Florian Bernard, Johan Thunberg, Paul Swoboda, and Christian Theobalt. Hippi: Higher-order projected power iterations for scalable multi-matching. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 2019. URLhttps://doi.org/10.1109/ ICCV.2019.01038
-
[18]
Sparse quadratic optimisation over the stiefel manifold with application to permutation synchronisation
Florian Bernard, Daniel Cremers, and Johan Thunberg. Sparse quadratic optimisation over the stiefel manifold with application to permutation synchronisation. InProc. of NeurIPS, 2021. URL https://proceedings.neurips.cc/paper/2021/ hash/d4bad256c73a6b25b86cc9c1a77255b1-Abstract.html
2021
-
[19]
Flower: A friendly federated learning research framework.arXiv preprint arXiv:2007.14390,
Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gus- mão, et al. Flower: A friendly federated learning research framework.ArXiv preprint, abs/2007.14390, 2020. URLhttps://arxiv.org/abs/2007.14390
-
[20]
Julian Blank and Kalyanmoy Deb. pymoo: Multi-objective optimization in Python. IEEE Access, 8:89497–89509, 2020. doi: 10.1109/ACCESS.2020.2990567
-
[21]
Oliver Bolton, Arash Ahmadian, Sara Hooker, Marzieh Fadaee, Beyza Ermis, et al. Simmerge: Learning to select merge operators from similarity signals.arXiv preprint arXiv:2601.09473, 2026
-
[22]
Food-101 – Mining Discrimi- native Components with Random Forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – Mining Discrimi- native Components with Random Forests. InComputer Vision – ECCV 2014. Springer International Publishing, 2014
2014
-
[23]
Item response theory models in the measurement theory.Commu- nications in Statistics – Simulation and Computation, 2020
Justyna Brzezinska. Item response theory models in the measurement theory.Commu- nications in Statistics – Simulation and Computation, 2020
2020
-
[24]
Reckase.Item Response Theory
Li Cai and Mark D. Reckase.Item Response Theory. Annual Review of Statistics and Its Application, 2016
2016
-
[25]
Dam: Dynamic adapter merging for continual video qa learning
Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, and Gedas Bertasius. Dam: Dynamic adapter merging for continual video qa learning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE Computer Society, 2025
2025
-
[26]
Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote Sensing Image Scene Classifica- tion: Benchmark and State of the Art.Proceedings of the IEEE, 105(10), 2017. ISSN 1558-2256. URL https://ieeexplore.ieee.org/document/7891544/ ?arnumber=7891544. Conference Name: Proceedings of the IEEE
-
[27]
Adapter- Soup: Weight averaging to improve generalization of pretrained language models
Alexandra Chronopoulou, Matthew Peters, Alexander Fraser, and Jesse Dodge. Adapter- Soup: Weight averaging to improve generalization of pretrained language models. In Findings of the Association for Computational Linguistics: EACL 2023. Association for BIBLIOGRAPHY169 Computational Linguistics, 2023. URL https://aclanthology.org/2023. findings-eacl.153
2023
-
[28]
Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70), 2024
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70), 2024
2024
-
[29]
Describing textures in the wild
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and An- drea Vedaldi. Describing textures in the wild. In2014 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28,
2014
-
[31]
Deep Learning for Classical Japanese Literature
Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Ya- mamoto, and David Ha. Deep Learning for Classical Japanese Literature, 2018. URL https://arxiv.org/abs/1812.01718
work page Pith review arXiv 2018
-
[32]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge.ArXiv preprint, abs/1803.05457, 2018. URL https: //arxiv.org/abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[33]
An Analysis of Single-Layer Net- works in Unsupervised Feature Learning
Adam Coates, Andrew Ng, and Honglak Lee. An Analysis of Single-Layer Net- works in Unsupervised Feature Learning. InProceedings of the Fourteenth Interna- tional Conference on Artificial Intelligence and Statistics. JMLR Workshop and Con- ference Proceedings, 2011. URL https://proceedings.mlr.press/v15/ coates11a.html. ISSN: 1938-7228
2011
-
[34]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
Emnist: an extension of mnist to handwritten letters, 2017
Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters, 2017
2017
-
[36]
Consistent partial matching of shape collections via sparse modeling.Computer Graphics Forum, 36(1), 2017
Luca Cosmo, Emanuele Rodolà, Andrea Albarelli, Facundo Mémoli, and Daniel Cre- mers. Consistent partial matching of shape collections via sparse modeling.Computer Graphics Forum, 36(1), 2017
2017
-
[37]
Mass: Mo- erging through adaptive subspace selection
Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, and Emanuele Rodolà. Mass: Mo- erging through adaptive subspace selection. InUniReps: 3rd Edition of the Workshop on Unifying Representations in Neural Models
-
[38]
C2M3: Cycle-consistent multi-model merging
Donato Crisostomi, Marco Fumero, Daniele Baieri, Florian Bernard, and Emanuele Rodolà. C2M3: Cycle-consistent multi-model merging. InAdvances in Neural Infor- mation Processing Systems, volume 37, 2025
2025
-
[39]
Ponti, Iryna Gurevych, and Moham- mad Emtiyaz Khan
Nico Daheim, Thomas Möllenhoff, Edoardo M. Ponti, Iryna Gurevych, and Moham- mad Emtiyaz Khan. Model merging by uncertainty-based gradient matching. InProc. of ICLR. OpenReview.net, 2024. URL https://openreview.net/forum?id= D7KJmfEDQP. 170BIBLIOGRAPHY
2024
-
[40]
No task left behind: Isotropic model merging with common and task-specific subspaces.ArXiv preprint, 2025
Marczak Daniel, Magistri Simone, Cygert Sebastian, Twardowski Bartłomiej, D Bag- danov Andrew, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces.ArXiv preprint, 2025
2025
-
[41]
Topology and geometry of Half-Rectified network optimization
C Daniel Freeman and Joan Bruna. Topology and geometry of Half-Rectified network optimization. November 2016
2016
-
[42]
Springer, 1997
Dipankar Dasgupta and Zbigniew Michalewicz.Evolutionary Algorithms in Engineer- ing Applications. Springer, 1997
1997
-
[43]
Model breadcrumbs: Scaling multi- task model merging with sparse masks
MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi- task model merging with sparse masks. InEuropean Conference on Computer Vision. Springer, 2025
2025
-
[44]
K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6(2), 2002
2002
-
[45]
Self-adaptive simulated binary crossover for real-parameter optimization
Kalyanmoy Deb, Karthik Sindhya, and Tatsuya Okabe. Self-adaptive simulated binary crossover for real-parameter optimization. InProceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, GECCO ’07. Association for Computing Machinery, 2007. URLhttps://doi.org/10.1145/1276958.1277190
-
[46]
The mnist database of handwritten digit images for machine learning research
Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 2012
2012
-
[47]
Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus
Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Proc. of NeurIPS, 2014. URL https://proceedings.neurips.cc/paper/ 2014/hash/2afe4567e1bf64d32a5527244d104cea-Abstract.html
2014
-
[48]
BERT: Pre- training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. InProc. of NAACL-HLT. Association for Computational Linguistics, 2019. URL https:// aclanthology.org/N19-1423
2019
-
[49]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProc. of ICLR. OpenReview.net, 2021. URLhttps://openreview.net/forum?i...
2021
-
[50]
Hamprecht
Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially no barriers in neural network energy landscape. InProc. of ICML, volume 80 ofProceedings of Machine Learning Research. PMLR, 2018. URL http://proceedings.mlr.press/v80/draxler18a.html
2018
-
[51]
Marion Young
Carl Eckart and G. Marion Young. The approximation of one matrix by another of lower rank.Psychometrika, 1, 1936. URL https://api.semanticscholar. org/CorpusID:10163399
1936
-
[52]
Eiben and J.E
A.E. Eiben and J.E. Smith.Introduction to Evolutionary Computing. Springer, 2015
2015
-
[53]
Learning factored representa- tions in a deep mixture of experts, 2014
David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representa- tions in a deep mixture of experts, 2014. URL https://arxiv.org/abs/1312. 4314. BIBLIOGRAPHY171
2014
-
[54]
The role of permutation invariance in linear mode connectivity of neural networks
Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks. InProc. of ICLR. OpenReview.net, 2022. URL https://openreview.net/forum?id= dNigytemkL
2022
-
[55]
An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1-2), 1956
Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1-2), 1956
1956
-
[56]
The lottery ticket hypothesis: Finding sparse, trainable neural networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations
-
[57]
Linear mode connectivity and the lottery ticket hypothesis
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InProc. of ICML, volume 119 ofProceedings of Machine Learning Research. PMLR, 2020. URL http://proceedings.mlr.press/v119/frankle20a.html
2020
-
[58]
Latent functional maps: a spectral framework for representation alignment
Marco Fumero, Marco Pegoraro, Valentino Maiorca, Francesco Locatello, and Emanuele Rodolà. Latent functional maps: a spectral framework for representation alignment. InAdvances in Neural Information Processing Systems, volume 37, 2024
2024
-
[59]
Efros, and Jacob Steinhardt
Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based decomposition. InProc. of ICLR. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=5Ca9sSzuDp
2024
-
[60]
Concept sliders: LoRA adaptors for precise control in diffusion models
Rohit Gandikota, Joanna Materzynska, Tinber Shi, and David Bau. Concept sliders: LoRA adaptors for precise control in diffusion models. InProc. of ECCV, 2024
2024
-
[61]
A framework for few-shot language model evaluation, 12
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...
- [62]
-
[63]
Task singular vectors: Reducing task interference in model merging
Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scarda- pane, Fabrizio Silvestri, and Emanuele Rodolà. Task singular vectors: Reducing task interference in model merging. InProc. CVPR, 2025
2025
-
[64]
Vetrov, and Andrew Gor- don Wilson
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gor- don Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InProc. of NeurIPS, 2018. URL https://proceedings.neurips.cc/paper/2018/ hash/be3087e74e9100d4bc4c6268cdbe8456-Abstract.html
2018
-
[65]
Matan Gavish and David L. Donoho. The optimal hard threshold for singular values is 4/ √ 3.IEEE Transactions on Information Theory, 60(8), 2014
2014
-
[66]
Arcee’s MergeKit: A toolkit for merging large language models, 2024
Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s MergeKit: A toolkit for merging large language models, 2024. URL https://arxiv.org/ abs/2403.13257
-
[67]
Size-independent sample complexity of neural networks
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InProc. of COLT, volume 75 ofProceedings of Machine Learning Research. PMLR, 2018. URL http://proceedings.mlr. press/v75/golowich18a.html. 172BIBLIOGRAPHY
2018
-
[68]
Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, Radu Ionescu, Marius Popescu, Cristian Grozea, James Bergstra, Ji...
2013
-
[69]
Qualitatively characterizing neural network optimization problems
Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. December 2014
2014
-
[70]
Fidel A. Guerrero-Peña, Heitor Rapela Medeiros, Thomas Dubail, Masih Amin- beidokhti, Eric Granger, and Marco Pedersoli. Re-basin via implicit sinkhorn dif- ferentiation. InIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 2023. URL https://doi.org/10.1109/CVPR52729.2023.01938
-
[71]
Diffusion soup: Model merging for text-to-image diffusion models
Aapo Hämäläinen et al. Diffusion soup: Model merging for text-to-image diffusion models. InProc. of ECCV, 2024
2024
-
[72]
Deep Residual Learning for Image Recognition , isbn =
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Computer Society, 2016. URLhttps://doi.org/10.1109/CVPR.2016.90
-
[73]
Merging experts into one: Improving computational efficiency of mixture of ex- perts
Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. Merging experts into one: Improving computational efficiency of mixture of ex- perts. InProc. of EMNLP. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.907
2023
-
[74]
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classifi- cation.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2019. ISSN 2151-1535. URL https://ieeexplore.ieee. org/document/8736785/?arnumber=8736785. Conference ...
-
[75]
Harmony in diversity: Merging neural networks with canonical correlation analysis
Stefan Horoi, Albert Manuel Orozco Camacho, Eugene Belilovsky, and Guy Wolf. Harmony in diversity: Merging neural networks with canonical correlation analysis. InProc. of ICML. OpenReview.net, 2024. URL https://openreview.net/ forum?id=hLuNVjRnY3
2024
-
[76]
Less is more: On the role of redun- dancy in model merging, 2025
Stefan Horoi, Guy Wolf, and Eugene Belilovsky. Less is more: On the role of redun- dancy in model merging, 2025. URL https://arxiv.org/abs/2502.05868
-
[77]
Lora: Low-rank adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInterna- tional Conference on Learning Representations
-
[78]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InProc. of ICLR. OpenReview.net, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9. BIBLIOGRAPHY173
2022
-
[79]
Emr- merging: Tuning-free high-performance model merging
Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr- merging: Tuning-free high-performance model merging. InProc. of NeurIPS, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/ dda5cac5272a9bcd4bc73d90bc725ef1-Abstract-Conference. html
2024
-
[80]
Qixing Huang, Fan Wang, and Leonidas Guibas. Functional map networks for analyzing and exploring large shape collections.ACM Trans. Graph., 33(4), jul 2014. ISSN 0730-0301. doi: 10.1145/2601097.2601111. URL https://doi.org/10.1145/ 2601097.2601111
-
[81]
Yerlan Idelbayev and Miguel Á. Carreira-Perpiñán. Low-rank compression of neural nets: Learning the rank of each layer. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020. URLhttps://doi.org/10.1109/CVPR42600.2020.00807
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.