pith. sign in

arxiv: 2412.19098 · v4 · pith:W754LCWPnew · submitted 2024-12-26 · 💻 cs.LG

SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation

Pith reviewed 2026-05-25 08:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords model mergingtask synergysingle-layer adaptationmulti-task learningself-labeling objectivevision benchmarksNLP benchmarks
0
0 comments X

The pith

Adapting only a single task-specific layer during merging induces task synergy that improves performance across tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that model merging can move beyond merely avoiding interference between tasks and instead create active synergy where one task boosts another's performance. It treats cross-task compatibility between encoders and predictors as the signal of a successful merge. The proposed approach shows that optimizing merge coefficients together with just one extra task-specific layer, guided by expert self-labeling, produces merged models that outperform prior merging techniques. This holds on vision, dense prediction, and NLP benchmarks and even succeeds when the source models come from different random starts, a setting where standard merging fails.

Core claim

The central claim is that joint optimization of merging coefficients and a single task-specific layer, using an expert-guided self-labeling objective for stable supervision, is enough to turn non-interfering merges into synergistic ones, yielding state-of-the-art multi-task results and working even for models trained from different initializations.

What carries the argument

Single task-specific layer adaptation, jointly optimized with merging coefficients under an expert-guided self-labeling objective.

If this is right

  • Merged models can exceed the individual-task performance of the original single-task models.
  • Merging succeeds without requiring the source models to share the same random initialization.
  • The method remains lightweight while matching or beating heavier merging baselines on vision, dense prediction, and language tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Full fine-tuning of all layers may be unnecessary once a single adapted layer can unlock synergy.
  • The approach could be tested on merging more than a handful of tasks or on very large foundation models.
  • If synergy proves robust, it reframes merging as a cheap route to multi-task systems rather than a compromise.

Load-bearing premise

Cross-task compatibility between encoders and predictors reliably predicts how well the merged model will perform on each task.

What would settle it

A set of benchmarks where merged models using single-layer adaptation show no gain or clear losses compared with non-adaptive merging, even when the cross-task compatibility scores are high.

Figures

Figures reproduced from arXiv: 2412.19098 by Aecheon Jung, Dongyoon Han, Seunghwan Lee, Sungeun Hong.

Figure 1
Figure 1. Figure 1: Training-Free methods collapse un￾der corruption. Worse than test-time methods on clean data as well, and far more degraded under corruption. Limitations of training-free methods. Our first motivation arises from the limitation of training-free methods, which are vulnerable when adapting to un￾seen tasks or domain shifts. To examine whether ex￾isting merging methods lack robustness to distribu￾tion shifts,… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-task vs. Merge Perfor￾mance. Positive correlation observed across 20 vision tasks with regression fit and 95% confidence interval. Rethinking cross-task performance. Another motiva￾tion comes from the intuition that a model’s cross-task performance1 is closely tied to its merging performance. To examine this, we conducted a preliminary study on 20 vision tasks using ViT-B/32. We observed a significan… view at source ↗
Figure 3
Figure 3. Figure 3: Two-stage pilot study protocol and its results on 8 cross-tasks using ViT-B/32. (a) We first enhance a classifier’s functional alignment by training it on representations from a general-purpose merged encoder. We then measure this enhancement by evaluating the trained classifier’s cross-task performance when paired with the encoder of a different, individual task. (b) The heatmap shows the accuracy gain (%… view at source ↗
Figure 4
Figure 4. Figure 4: Spearman correlation of proxy losses with ground truth cross-entropy loss. We compare coef￾ficients for Entropy and Ours using merged weights before and after training. A coefficient closer to +1 in￾dicates a more reliable proxy for the true objective. On the choice of the objective function. An effective proxy objective for test-time adapta￾tion must maintain a strong correlation with the ground-truth obj… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of training task-specific layers for task synergistic merging. We compare partial trainings defined by specific layers with the coefficient (“Coef”) training. Single-layer/classifier training (with coeffi￾cients) works; multi-layer training fails. The results show a remarkable degree of transferability. When paired with the TA encoder, it boosts merged performance by 10.5%p and, critically, improves… view at source ↗
Figure 6
Figure 6. Figure 6: More analyses with merging coefficients. (a) Refined supervisory models merged under various coefficients are tested. Our design choice – using the unmerged individual models – per￾forms near-optimally. (b)Our method shows strong robustness to different initial merging coefficients. More studies with merging coeffi￾cients. We study whether employing individual model predictions without any refinement is a … view at source ↗
Figure 7
Figure 7. Figure 7: Prediction discrepancies between merged and individual model. (a) The upper bars indicate pre￾dictions correctly classified by individual models but misclassified by the merged model; the lower (hatched) bars indicate the opposite. A lower upper portion and a higher lower portion indicate a positive impact on the merged model performance. (b) represents the overall difference between the upper and lower ba… view at source ↗
read the original abstract

Model merging combines independently trained models into a single multi-task model. However, most existing approaches focus primarily on avoiding task interference. We argue that its greater potential lies in enabling task synergy, where tasks actively improve one another. We identify cross-task performance, defined by compatibility between encoders and predictors across tasks, as a key indicator of merge quality. We demonstrate that adapting only a single task-specific layer is sufficient to induce such synergy. This study proposes SyMerge, a lightweight framework that jointly optimizes merging coefficients and a single task-specific layer. We adopt an expert-guided self-labeling objective, providing stable supervision beyond entropy minimization. Intriguingly, we further show that SyMerge successfully merges models trained from different initializations, a regime where standard methods break down. Our minimalist yet principled method achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks. Our code is available at https://aim-skku.github.io/SyMerge

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes SyMerge, a lightweight model merging framework that shifts focus from avoiding task interference to enabling task synergy. It identifies cross-task performance (compatibility between encoders and predictors across tasks) as a key indicator of merge quality, claims that adapting only a single task-specific layer suffices to induce such synergy, and describes a method that jointly optimizes merging coefficients and this layer using an expert-guided self-labeling objective. The work reports state-of-the-art results across vision, dense prediction, and NLP benchmarks, plus successful merging of models trained from different initializations, with code released.

Significance. If the central empirical claims hold, the result would be significant for model merging research by offering a minimalist, principled route to synergistic multi-task models that works even across different initializations. The public code release is a clear strength for reproducibility. The approach could influence future merging techniques if the single-layer adaptation mechanism is shown to be both sufficient and mechanistically justified.

major comments (1)
  1. [Abstract] Abstract: The manuscript explicitly positions cross-task performance (compatibility between encoders and predictors) as 'a key indicator of merge quality' that justifies the single-layer adaptation design. No correlation analysis, ablation, or controlled comparison is referenced showing that gains in this compatibility measure predict or causally relate to actual merged-model accuracy on downstream tasks. This link is load-bearing for the sufficiency claim; without it the justification for why single-layer adaptation induces synergy (rather than merely fitting coefficients) remains unestablished.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern about the justification for single-layer adaptation below and will strengthen the manuscript with additional analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript explicitly positions cross-task performance (compatibility between encoders and predictors across tasks) as 'a key indicator of merge quality' that justifies the single-layer adaptation design. No correlation analysis, ablation, or controlled comparison is referenced showing that gains in this compatibility measure predict or causally relate to actual merged-model accuracy on downstream tasks. This link is load-bearing for the sufficiency claim; without it the justification for why single-layer adaptation induces synergy (rather than merely fitting coefficients) remains unestablished.

    Authors: We agree that the current version does not include explicit correlation analysis, ablations, or controlled comparisons linking gains in cross-task performance to downstream merged-model accuracy. This is a valid point regarding the load-bearing nature of the claim. In the revision, we will add a new subsection (with figures and tables) presenting: (i) Pearson/Spearman correlations between cross-task compatibility scores and final task accuracies across multiple merge settings; (ii) controlled ablations where we vary only the single-layer adaptation while holding merging coefficients fixed, showing the incremental benefit; and (iii) comparisons against coefficient-only baselines. These additions will directly establish the predictive and causal relationship, thereby justifying why single-layer adaptation induces synergy beyond mere coefficient fitting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical joint optimization with external benchmarks

full rationale

The paper describes an empirical method (SyMerge) that jointly optimizes merging coefficients and one task-specific layer under an expert-guided self-labeling objective. No equations, derivations, or 'predictions' are presented that reduce by construction to fitted inputs or self-definitions. Cross-task compatibility is positioned as a hypothesized indicator whose validity is tested against downstream benchmarks rather than assumed by definition. The approach is self-contained against external vision/NLP/dense-prediction results and does not rely on load-bearing self-citations or uniqueness theorems imported from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard supervised optimization assumptions plus the domain claim that single-layer changes suffice for synergy; no new physical or mathematical entities are introduced.

free parameters (1)
  • merging coefficients
    Jointly optimized with the single task-specific layer; values are not pre-specified.
axioms (2)
  • domain assumption Cross-task performance defined by encoder-predictor compatibility is a reliable indicator of merge quality
    Explicitly identified in abstract as key indicator
  • domain assumption Expert-guided self-labeling supplies stable supervision beyond entropy minimization
    Stated as the chosen objective in abstract

pith-pipeline@v0.9.0 · 5706 in / 1282 out tokens · 28903 ms · 2026-05-25T08:03:54.865784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

  1. [1]

    Multitask learning

    Rich Caruana. Multitask learning. Machine learning, 28: 0 41--75, 1997

  2. [2]

    S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, I \ n igo Lopez-Gazpio, and Lucia Specia. S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proc. of Int'l Workshop on Semantic Evaluation (SemEval) , pp.\ 1--14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi:10.18653/v1/S17-2001. U...

  3. [3]

    Similarity and matching of neural network representations

    Adri \'a n Csisz \'a rik, P \'e ter K o r \"o si-Szab \'o , Akos Matszangosz, Gergely Papp, and D \'a niel Varga. Similarity and matching of neural network representations. In Proc. of Neural Information Processing Systems (NeurIPS) , volume 34, pp.\ 5656--5668, 2021

  4. [4]

    Model breadcrumbs: Scaling multi-task model merging with sparse masks

    MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. In Proc. of European Conf. on Computer Vision (ECCV) , 2024

  5. [5]

    Della-merging: Reducing interference in model merging through magnitude-based sampling

    Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling. arXiv preprint arXiv:2406.11617, 2024

  6. [6]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 248--255. Ieee, 2009

  7. [7]

    Automatically constructing a corpus of sentential paraphrases

    Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proc. of Int'l Workshop on Paraphrasing (IWP) , 2005

  8. [8]

    Parameter competition balancing for model merging

    Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. Parameter competition balancing for model merging. Proc. of Neural Information Processing Systems (NeurIPS) , 37: 0 84746--84776, 2024

  9. [9]

    Task singular vectors: Reducing task interference in model merging

    Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task interference in model merging. In Proc. of Computer Vision and Pattern Recognition (CVPR) , 2025

  10. [10]

    The third pascal recognizing textual entailment challenge

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and William B Dolan. The third pascal recognizing textual entailment challenge. In Proc. of ACL-PASCAL Workshop on Textual Entailment and Paraphrasing , pp.\ 1--9, 2007

  11. [11]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 770--778, 2016

  12. [12]

    Benchmarking neural network robustness to common corruptions and perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2019

  13. [13]

    Revisiting scalarization in multi-task learning: A theoretical perspective

    Yuzheng Hu, Ruicheng Xian, Qilong Wu, Qiuling Fan, Lang Yin, and Han Zhao. Revisiting scalarization in multi-task learning: A theoretical perspective. Proc. of Neural Information Processing Systems (NeurIPS) , 36, 2024

  14. [14]

    Emr-merging: Tuning-free high-performance model merging

    Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high-performance model merging. Proc. of Neural Information Processing Systems (NeurIPS) , 2024

  15. [15]

    Patching open-vocabulary models by interpolating weights

    Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, and Ludwig Schmidt. Patching open-vocabulary models by interpolating weights. Proc. of Neural Information Processing Systems (NeurIPS) , 35: 0 29262--29277, 2022

  16. [16]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2023

  17. [17]

    First quora dataset release: Question pairs

    Shankar Iyer, Nikhil Dandekar, Korn \'e l Csernai, et al. First quora dataset release: Question pairs. data. quora. com. 2017

  18. [18]

    Fine-tuning attention modules only: Enhancing weight disentanglement in task arithmetic

    Ruochen Jin, Bojian Hou, Jiancong Xiao, Weijie Su, and Li Shen. Fine-tuning attention modules only: Enhancing weight disentanglement in task arithmetic. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2025

  19. [19]

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

    Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 7482--7491, 2018

  20. [20]

    Adam: A method for stochastic optimization

    Diederik P Kingma. Adam: A method for stochastic optimization. Proc. of Int'l Conf. on Learning Representation (ICLR) , 2015

  21. [21]

    Overcoming catastrophic forgetting in neural networks

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114 0 (13): 0 3521--3526, 2017

  22. [22]

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

    Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proc. of Int'l Conf. on Machine Learning (ICML) , volume 3, pp.\ 896. Atlanta, 2013

  23. [23]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y Liu, M Ott, N Goyal, J Du, M Joshi, D Chen, O Levy, M Lewis, L Zettlemoyer, and V Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  24. [24]

    No task left behind: Isotropic model merging with common and task-specific subspaces

    Daniel Marczak, Simone Magistri, Sebastian Cygert, Bart omiej Twardowski, Andrew D Bagdanov, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2025 a

  25. [25]

    Magmax: Leveraging model merging for seamless continual learning

    Daniel Marczak, Bart omiej Twardowski, Tomasz Trzci \'n ski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. In Proc. of European Conf. on Computer Vision (ECCV) , pp.\ 379--395. Springer, 2025 b

  26. [26]

    Cross-stitch networks for multi-task learning

    Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 3994--4003, 2016

  27. [27]

    Reading digits in natural images with unsupervised feature learning

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In Proc. of Neural Information Processing Systems Workshops (NeurIPSW) , volume 2011, pp.\ 4. Granada, 2011

  28. [28]

    Towards calibrated robust fine-tuning of vision-language models

    Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models. Proc. of Neural Information Processing Systems (NeurIPS) , 2024

  29. [29]

    Dawin: Training-free dynamic weight interpolation for robust adaptation

    Changdae Oh, Yixuan Li, Kyungwoo Song, Sangdoo Yun, and Dongyoon Han. Dawin: Training-free dynamic weight interpolation for robust adaptation. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2025

  30. [30]

    Task arithmetic in the tangent space: Improved editing of pre-trained models

    Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. Proc. of Neural Information Processing Systems (NeurIPS) , 36: 0 66727--66754, 2023

  31. [31]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In Proc. of Int'l Conf. on Machine Learning (ICML) , pp.\ 8748--8763. PMLR, 2021

  32. [32]

    SQ u AD : 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQ u AD : 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP) , pp.\ 2383--2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.1...

  33. [33]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Proc. of European Conf. on Computer Vision (ECCV) , pp.\ 746--760. Springer, 2012

  34. [34]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP) , pp.\ 1631--1642, 2013

  35. [35]

    The german traffic sign recognition benchmark: a multi-class classification competition

    Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: a multi-class classification competition. In Proc. of International Joint Conference on Neural Networks , pp.\ 1453--1460. IEEE, 2011

  36. [36]

    Fusionbench: A comprehensive benchmark of deep model fusion

    Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion. arXiv preprint arXiv:2406.03280, 2024 a

  37. [37]

    Merging multi-task models via weight-ensembling mixture of experts

    Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, and Dacheng Tao. Merging multi-task models via weight-ensembling mixture of experts. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024 b

  38. [38]

    Glue: A multi-task benchmark and analysis platform for natural language understanding

    Alex Wang. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2019

  39. [39]

    Localizing task information for improved model merging and compression

    Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz - Jim \' e nez, Fran c ois Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024

  40. [40]

    Lines: Post-training layer scaling prevents forgetting and enhances model merging

    Ke Wang, Nikolaos Dimitriadis, Alessandro Favero, Guillermo Ortiz-Jimenez, Francois Fleuret, and Pascal Frossard. Lines: Post-training layer scaling prevents forgetting and enhances model merging. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2025

  41. [41]

    Neural network acceptability judgments

    Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics (TACL), 7: 0 625--641, 2019

  42. [42]

    Representation surgery in model merging with probabilistic modeling

    Qi Wei, Shuo He, Enneng Yang, Tingcong Liu, Haobo Wang, Lei Feng, and Bo An. Representation surgery in model merging with probabilistic modeling. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2025

  43. [43]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics (NAACL) , pp.\ 1112--1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:10.18653/v1/N18-1...

  44. [44]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

    Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proc. of Int'l Conf. on Machine Learning (ICML) , pp.\ 23965--239...

  45. [45]

    Robust fine-tuning of zero-shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 7959--7971, 2022 b

  46. [46]

    Scalable model merging with progressive layer-wise distillation

    Jing Xu, Jiazheng Li, and Jingzhao Zhang. Scalable model merging with progressive layer-wise distillation. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2025

  47. [47]

    Ties-merging: Resolving interference when merging models

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. Proc. of Neural Information Processing Systems (NeurIPS) , 36, 2024

  48. [48]

    Representation surgery for multi-task model merging

    Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024 a

  49. [49]

    Adamerging: Adaptive model merging for multi-task learning

    Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2024 b

  50. [50]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024

  51. [51]

    Gradient surgery for multi-task learning

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Proc. of Neural Information Processing Systems (NeurIPS) , 33: 0 5824--5836, 2020

  52. [52]

    Free-merging: Fourier transform for efficient model merging

    Shenghe Zheng and Hongzhi Wang. Free-merging: Fourier transform for efficient model merging. In Proc. of Int'l Conf. on Computer Vision (ICCV) , 2025

  53. [53]

    On the emergence of cross-task linearity in the pretraining-finetuning paradigm

    Zhanpeng Zhou, Zijun Chen, Yilan Chen, Bo Zhang, and Junchi Yan. On the emergence of cross-task linearity in the pretraining-finetuning paradigm. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024

  54. [54]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  55. [55]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  56. [56]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  57. [57]

    Most results are obtained on an NVIDIA RTX 4090 GPU, while experiments involving ViT-L/14 are performed on an NVIDIA RTX A6000 GPU

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...