SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation

Aecheon Jung; Dongyoon Han; Seunghwan Lee; Sungeun Hong

arxiv: 2412.19098 · v4 · pith:W754LCWPnew · submitted 2024-12-26 · 💻 cs.LG

SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation

Aecheon Jung , Seunghwan Lee , Dongyoon Han , Sungeun Hong This is my paper

Pith reviewed 2026-05-25 08:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords model mergingtask synergysingle-layer adaptationmulti-task learningself-labeling objectivevision benchmarksNLP benchmarks

0 comments

The pith

Adapting only a single task-specific layer during merging induces task synergy that improves performance across tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that model merging can move beyond merely avoiding interference between tasks and instead create active synergy where one task boosts another's performance. It treats cross-task compatibility between encoders and predictors as the signal of a successful merge. The proposed approach shows that optimizing merge coefficients together with just one extra task-specific layer, guided by expert self-labeling, produces merged models that outperform prior merging techniques. This holds on vision, dense prediction, and NLP benchmarks and even succeeds when the source models come from different random starts, a setting where standard merging fails.

Core claim

The central claim is that joint optimization of merging coefficients and a single task-specific layer, using an expert-guided self-labeling objective for stable supervision, is enough to turn non-interfering merges into synergistic ones, yielding state-of-the-art multi-task results and working even for models trained from different initializations.

What carries the argument

Single task-specific layer adaptation, jointly optimized with merging coefficients under an expert-guided self-labeling objective.

If this is right

Merged models can exceed the individual-task performance of the original single-task models.
Merging succeeds without requiring the source models to share the same random initialization.
The method remains lightweight while matching or beating heavier merging baselines on vision, dense prediction, and language tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Full fine-tuning of all layers may be unnecessary once a single adapted layer can unlock synergy.
The approach could be tested on merging more than a handful of tasks or on very large foundation models.
If synergy proves robust, it reframes merging as a cheap route to multi-task systems rather than a compromise.

Load-bearing premise

Cross-task compatibility between encoders and predictors reliably predicts how well the merged model will perform on each task.

What would settle it

A set of benchmarks where merged models using single-layer adaptation show no gain or clear losses compared with non-adaptive merging, even when the cross-task compatibility scores are high.

Figures

Figures reproduced from arXiv: 2412.19098 by Aecheon Jung, Dongyoon Han, Seunghwan Lee, Sungeun Hong.

**Figure 1.** Figure 1: Training-Free methods collapse under corruption. Worse than test-time methods on clean data as well, and far more degraded under corruption. Limitations of training-free methods. Our first motivation arises from the limitation of training-free methods, which are vulnerable when adapting to unseen tasks or domain shifts. To examine whether existing merging methods lack robustness to distribution shifts,… view at source ↗

**Figure 2.** Figure 2: Cross-task vs. Merge Performance. Positive correlation observed across 20 vision tasks with regression fit and 95% confidence interval. Rethinking cross-task performance. Another motivation comes from the intuition that a model’s cross-task performance1 is closely tied to its merging performance. To examine this, we conducted a preliminary study on 20 vision tasks using ViT-B/32. We observed a significan… view at source ↗

**Figure 3.** Figure 3: Two-stage pilot study protocol and its results on 8 cross-tasks using ViT-B/32. (a) We first enhance a classifier’s functional alignment by training it on representations from a general-purpose merged encoder. We then measure this enhancement by evaluating the trained classifier’s cross-task performance when paired with the encoder of a different, individual task. (b) The heatmap shows the accuracy gain (%… view at source ↗

**Figure 4.** Figure 4: Spearman correlation of proxy losses with ground truth cross-entropy loss. We compare coefficients for Entropy and Ours using merged weights before and after training. A coefficient closer to +1 indicates a more reliable proxy for the true objective. On the choice of the objective function. An effective proxy objective for test-time adaptation must maintain a strong correlation with the ground-truth obj… view at source ↗

**Figure 5.** Figure 5: Impact of training task-specific layers for task synergistic merging. We compare partial trainings defined by specific layers with the coefficient (“Coef”) training. Single-layer/classifier training (with coefficients) works; multi-layer training fails. The results show a remarkable degree of transferability. When paired with the TA encoder, it boosts merged performance by 10.5%p and, critically, improves… view at source ↗

**Figure 6.** Figure 6: More analyses with merging coefficients. (a) Refined supervisory models merged under various coefficients are tested. Our design choice – using the unmerged individual models – performs near-optimally. (b)Our method shows strong robustness to different initial merging coefficients. More studies with merging coefficients. We study whether employing individual model predictions without any refinement is a … view at source ↗

**Figure 7.** Figure 7: Prediction discrepancies between merged and individual model. (a) The upper bars indicate predictions correctly classified by individual models but misclassified by the merged model; the lower (hatched) bars indicate the opposite. A lower upper portion and a higher lower portion indicate a positive impact on the merged model performance. (b) represents the overall difference between the upper and lower ba… view at source ↗

read the original abstract

Model merging combines independently trained models into a single multi-task model. However, most existing approaches focus primarily on avoiding task interference. We argue that its greater potential lies in enabling task synergy, where tasks actively improve one another. We identify cross-task performance, defined by compatibility between encoders and predictors across tasks, as a key indicator of merge quality. We demonstrate that adapting only a single task-specific layer is sufficient to induce such synergy. This study proposes SyMerge, a lightweight framework that jointly optimizes merging coefficients and a single task-specific layer. We adopt an expert-guided self-labeling objective, providing stable supervision beyond entropy minimization. Intriguingly, we further show that SyMerge successfully merges models trained from different initializations, a regime where standard methods break down. Our minimalist yet principled method achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks. Our code is available at https://aim-skku.github.io/SyMerge

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SyMerge's single-layer synergy claim depends on an unshown link between their cross-task compatibility measure and actual merged-model gains.

read the letter

The paper's main move is reframing merging as synergy creation instead of interference avoidance, then showing that joint optimization of merge coefficients plus one task-specific layer plus expert-guided self-labeling can produce it. The different-initialization result is the clearest potential contribution if the experiments hold; most prior merging work assumes shared initialization and breaks when that assumption is dropped. The minimalist framing and code release are also straightforward positives for anyone who wants to test the idea quickly. The soft spot is the justification for the single-layer design. The abstract treats cross-task performance (encoder-predictor compatibility) as the central indicator of merge quality, yet supplies no correlation, ablation, or controlled check showing that this measure actually predicts downstream accuracy on the merged model. Without that link the rationale for why one layer is sufficient stays unanchored. The abstract also gives no experimental details, baselines, or ablation tables, so the SOTA claim cannot be evaluated from what is written. This is the kind of paper that belongs in a reading group focused on model merging or multi-task efficiency, mainly to see whether the full experiments close the correlation gap and whether the different-init result replicates. It is worth sending to peer review because the different-initialization regime is practically relevant and the method is cheap enough to check; a referee can ask for the missing correlation analysis and the full ablation set without the paper needing major redesign.

Referee Report

1 major / 0 minor

Summary. The paper proposes SyMerge, a lightweight model merging framework that shifts focus from avoiding task interference to enabling task synergy. It identifies cross-task performance (compatibility between encoders and predictors across tasks) as a key indicator of merge quality, claims that adapting only a single task-specific layer suffices to induce such synergy, and describes a method that jointly optimizes merging coefficients and this layer using an expert-guided self-labeling objective. The work reports state-of-the-art results across vision, dense prediction, and NLP benchmarks, plus successful merging of models trained from different initializations, with code released.

Significance. If the central empirical claims hold, the result would be significant for model merging research by offering a minimalist, principled route to synergistic multi-task models that works even across different initializations. The public code release is a clear strength for reproducibility. The approach could influence future merging techniques if the single-layer adaptation mechanism is shown to be both sufficient and mechanistically justified.

major comments (1)

[Abstract] Abstract: The manuscript explicitly positions cross-task performance (compatibility between encoders and predictors) as 'a key indicator of merge quality' that justifies the single-layer adaptation design. No correlation analysis, ablation, or controlled comparison is referenced showing that gains in this compatibility measure predict or causally relate to actual merged-model accuracy on downstream tasks. This link is load-bearing for the sufficiency claim; without it the justification for why single-layer adaptation induces synergy (rather than merely fitting coefficients) remains unestablished.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern about the justification for single-layer adaptation below and will strengthen the manuscript with additional analysis.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript explicitly positions cross-task performance (compatibility between encoders and predictors across tasks) as 'a key indicator of merge quality' that justifies the single-layer adaptation design. No correlation analysis, ablation, or controlled comparison is referenced showing that gains in this compatibility measure predict or causally relate to actual merged-model accuracy on downstream tasks. This link is load-bearing for the sufficiency claim; without it the justification for why single-layer adaptation induces synergy (rather than merely fitting coefficients) remains unestablished.

Authors: We agree that the current version does not include explicit correlation analysis, ablations, or controlled comparisons linking gains in cross-task performance to downstream merged-model accuracy. This is a valid point regarding the load-bearing nature of the claim. In the revision, we will add a new subsection (with figures and tables) presenting: (i) Pearson/Spearman correlations between cross-task compatibility scores and final task accuracies across multiple merge settings; (ii) controlled ablations where we vary only the single-layer adaptation while holding merging coefficients fixed, showing the incremental benefit; and (iii) comparisons against coefficient-only baselines. These additions will directly establish the predictive and causal relationship, thereby justifying why single-layer adaptation induces synergy beyond mere coefficient fitting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical joint optimization with external benchmarks

full rationale

The paper describes an empirical method (SyMerge) that jointly optimizes merging coefficients and one task-specific layer under an expert-guided self-labeling objective. No equations, derivations, or 'predictions' are presented that reduce by construction to fitted inputs or self-definitions. Cross-task compatibility is positioned as a hypothesized indicator whose validity is tested against downstream benchmarks rather than assumed by definition. The approach is self-contained against external vision/NLP/dense-prediction results and does not rely on load-bearing self-citations or uniqueness theorems imported from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard supervised optimization assumptions plus the domain claim that single-layer changes suffice for synergy; no new physical or mathematical entities are introduced.

free parameters (1)

merging coefficients
Jointly optimized with the single task-specific layer; values are not pre-specified.

axioms (2)

domain assumption Cross-task performance defined by encoder-predictor compatibility is a reliable indicator of merge quality
Explicitly identified in abstract as key indicator
domain assumption Expert-guided self-labeling supplies stable supervision beyond entropy minimization
Stated as the chosen objective in abstract

pith-pipeline@v0.9.0 · 5706 in / 1282 out tokens · 28903 ms · 2026-05-25T08:03:54.865784+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

[1]

Multitask learning

Rich Caruana. Multitask learning. Machine learning, 28: 0 41--75, 1997

work page 1997
[2]

S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, I \ n igo Lopez-Gazpio, and Lucia Specia. S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proc. of Int'l Workshop on Semantic Evaluation (SemEval) , pp.\ 1--14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi:10.18653/v1/S17-2001. U...

work page doi:10.18653/v1/s17-2001 2017
[3]

Similarity and matching of neural network representations

Adri \'a n Csisz \'a rik, P \'e ter K o r \"o si-Szab \'o , Akos Matszangosz, Gergely Papp, and D \'a niel Varga. Similarity and matching of neural network representations. In Proc. of Neural Information Processing Systems (NeurIPS) , volume 34, pp.\ 5656--5668, 2021

work page 2021
[4]

Model breadcrumbs: Scaling multi-task model merging with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. In Proc. of European Conf. on Computer Vision (ECCV) , 2024

work page 2024
[5]

Della-merging: Reducing interference in model merging through magnitude-based sampling

Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling. arXiv preprint arXiv:2406.11617, 2024

work page arXiv 2024
[6]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 248--255. Ieee, 2009

work page 2009
[7]

Automatically constructing a corpus of sentential paraphrases

Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proc. of Int'l Workshop on Paraphrasing (IWP) , 2005

work page 2005
[8]

Parameter competition balancing for model merging

Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. Parameter competition balancing for model merging. Proc. of Neural Information Processing Systems (NeurIPS) , 37: 0 84746--84776, 2024

work page 2024
[9]

Task singular vectors: Reducing task interference in model merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task interference in model merging. In Proc. of Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025
[10]

The third pascal recognizing textual entailment challenge

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and William B Dolan. The third pascal recognizing textual entailment challenge. In Proc. of ACL-PASCAL Workshop on Textual Entailment and Paraphrasing , pp.\ 1--9, 2007

work page 2007
[11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 770--778, 2016

work page 2016
[12]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2019

work page 2019
[13]

Revisiting scalarization in multi-task learning: A theoretical perspective

Yuzheng Hu, Ruicheng Xian, Qilong Wu, Qiuling Fan, Lang Yin, and Han Zhao. Revisiting scalarization in multi-task learning: A theoretical perspective. Proc. of Neural Information Processing Systems (NeurIPS) , 36, 2024

work page 2024
[14]

Emr-merging: Tuning-free high-performance model merging

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high-performance model merging. Proc. of Neural Information Processing Systems (NeurIPS) , 2024

work page 2024
[15]

Patching open-vocabulary models by interpolating weights

Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, and Ludwig Schmidt. Patching open-vocabulary models by interpolating weights. Proc. of Neural Information Processing Systems (NeurIPS) , 35: 0 29262--29277, 2022

work page 2022
[16]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2023

work page 2023
[17]

First quora dataset release: Question pairs

Shankar Iyer, Nikhil Dandekar, Korn \'e l Csernai, et al. First quora dataset release: Question pairs. data. quora. com. 2017

work page 2017
[18]

Fine-tuning attention modules only: Enhancing weight disentanglement in task arithmetic

Ruochen Jin, Bojian Hou, Jiancong Xiao, Weijie Su, and Li Shen. Fine-tuning attention modules only: Enhancing weight disentanglement in task arithmetic. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2025

work page 2025
[19]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 7482--7491, 2018

work page 2018
[20]

Adam: A method for stochastic optimization

Diederik P Kingma. Adam: A method for stochastic optimization. Proc. of Int'l Conf. on Learning Representation (ICLR) , 2015

work page 2015
[21]

Overcoming catastrophic forgetting in neural networks

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114 0 (13): 0 3521--3526, 2017

work page 2017
[22]

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proc. of Int'l Conf. on Machine Learning (ICML) , volume 3, pp.\ 896. Atlanta, 2013

work page 2013
[23]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y Liu, M Ott, N Goyal, J Du, M Joshi, D Chen, O Levy, M Lewis, L Zettlemoyer, and V Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[24]

No task left behind: Isotropic model merging with common and task-specific subspaces

Daniel Marczak, Simone Magistri, Sebastian Cygert, Bart omiej Twardowski, Andrew D Bagdanov, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2025 a

work page 2025
[25]

Magmax: Leveraging model merging for seamless continual learning

Daniel Marczak, Bart omiej Twardowski, Tomasz Trzci \'n ski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. In Proc. of European Conf. on Computer Vision (ECCV) , pp.\ 379--395. Springer, 2025 b

work page 2025
[26]

Cross-stitch networks for multi-task learning

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 3994--4003, 2016

work page 2016
[27]

Reading digits in natural images with unsupervised feature learning

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In Proc. of Neural Information Processing Systems Workshops (NeurIPSW) , volume 2011, pp.\ 4. Granada, 2011

work page 2011
[28]

Towards calibrated robust fine-tuning of vision-language models

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models. Proc. of Neural Information Processing Systems (NeurIPS) , 2024

work page 2024
[29]

Dawin: Training-free dynamic weight interpolation for robust adaptation

Changdae Oh, Yixuan Li, Kyungwoo Song, Sangdoo Yun, and Dongyoon Han. Dawin: Training-free dynamic weight interpolation for robust adaptation. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2025

work page 2025
[30]

Task arithmetic in the tangent space: Improved editing of pre-trained models

Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. Proc. of Neural Information Processing Systems (NeurIPS) , 36: 0 66727--66754, 2023

work page 2023
[31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In Proc. of Int'l Conf. on Machine Learning (ICML) , pp.\ 8748--8763. PMLR, 2021

work page 2021
[32]

SQ u AD : 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQ u AD : 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP) , pp.\ 2383--2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.1...

work page doi:10.18653/v1/d16-1264 2016
[33]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Proc. of European Conf. on Computer Vision (ECCV) , pp.\ 746--760. Springer, 2012

work page 2012
[34]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP) , pp.\ 1631--1642, 2013

work page 2013
[35]

The german traffic sign recognition benchmark: a multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: a multi-class classification competition. In Proc. of International Joint Conference on Neural Networks , pp.\ 1453--1460. IEEE, 2011

work page 2011
[36]

Fusionbench: A comprehensive benchmark of deep model fusion

Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion. arXiv preprint arXiv:2406.03280, 2024 a

work page arXiv 2024
[37]

Merging multi-task models via weight-ensembling mixture of experts

Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, and Dacheng Tao. Merging multi-task models via weight-ensembling mixture of experts. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024 b

work page 2024
[38]

Glue: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2019

work page 2019
[39]

Localizing task information for improved model merging and compression

Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz - Jim \' e nez, Fran c ois Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024

work page 2024
[40]

Lines: Post-training layer scaling prevents forgetting and enhances model merging

Ke Wang, Nikolaos Dimitriadis, Alessandro Favero, Guillermo Ortiz-Jimenez, Francois Fleuret, and Pascal Frossard. Lines: Post-training layer scaling prevents forgetting and enhances model merging. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2025

work page 2025
[41]

Neural network acceptability judgments

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics (TACL), 7: 0 625--641, 2019

work page 2019
[42]

Representation surgery in model merging with probabilistic modeling

Qi Wei, Shuo He, Enneng Yang, Tingcong Liu, Haobo Wang, Lei Feng, and Bo An. Representation surgery in model merging with probabilistic modeling. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2025

work page 2025
[43]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics (NAACL) , pp.\ 1112--1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:10.18653/v1/N18-1...

work page internal anchor Pith review doi:10.18653/v1/n18-1101 2018
[44]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proc. of Int'l Conf. on Machine Learning (ICML) , pp.\ 23965--239...

work page 2022
[45]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 7959--7971, 2022 b

work page 2022
[46]

Scalable model merging with progressive layer-wise distillation

Jing Xu, Jiazheng Li, and Jingzhao Zhang. Scalable model merging with progressive layer-wise distillation. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2025

work page 2025
[47]

Ties-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. Proc. of Neural Information Processing Systems (NeurIPS) , 36, 2024

work page 2024
[48]

Representation surgery for multi-task model merging

Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024 a

work page 2024
[49]

Adamerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2024 b

work page 2024
[50]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024

work page 2024
[51]

Gradient surgery for multi-task learning

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Proc. of Neural Information Processing Systems (NeurIPS) , 33: 0 5824--5836, 2020

work page 2020
[52]

Free-merging: Fourier transform for efficient model merging

Shenghe Zheng and Hongzhi Wang. Free-merging: Fourier transform for efficient model merging. In Proc. of Int'l Conf. on Computer Vision (ICCV) , 2025

work page 2025
[53]

On the emergence of cross-task linearity in the pretraining-finetuning paradigm

Zhanpeng Zhou, Zijun Chen, Yilan Chen, Bo Zhang, and Junchi Yan. On the emergence of cross-task linearity in the pretraining-finetuning paradigm. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024

work page 2024
[54]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[55]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[56]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[57]

Most results are obtained on an NVIDIA RTX 4090 GPU, while experiments involving ViT-L/14 are performed on an NVIDIA RTX A6000 GPU

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv

[1] [1]

Multitask learning

Rich Caruana. Multitask learning. Machine learning, 28: 0 41--75, 1997

work page 1997

[2] [2]

S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, I \ n igo Lopez-Gazpio, and Lucia Specia. S em E val-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proc. of Int'l Workshop on Semantic Evaluation (SemEval) , pp.\ 1--14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi:10.18653/v1/S17-2001. U...

work page doi:10.18653/v1/s17-2001 2017

[3] [3]

Similarity and matching of neural network representations

Adri \'a n Csisz \'a rik, P \'e ter K o r \"o si-Szab \'o , Akos Matszangosz, Gergely Papp, and D \'a niel Varga. Similarity and matching of neural network representations. In Proc. of Neural Information Processing Systems (NeurIPS) , volume 34, pp.\ 5656--5668, 2021

work page 2021

[4] [4]

Model breadcrumbs: Scaling multi-task model merging with sparse masks

MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. In Proc. of European Conf. on Computer Vision (ECCV) , 2024

work page 2024

[5] [5]

Della-merging: Reducing interference in model merging through magnitude-based sampling

Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Poria. Della-merging: Reducing interference in model merging through magnitude-based sampling. arXiv preprint arXiv:2406.11617, 2024

work page arXiv 2024

[6] [6]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 248--255. Ieee, 2009

work page 2009

[7] [7]

Automatically constructing a corpus of sentential paraphrases

Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proc. of Int'l Workshop on Paraphrasing (IWP) , 2005

work page 2005

[8] [8]

Parameter competition balancing for model merging

Guodong Du, Junlin Lee, Jing Li, Runhua Jiang, Yifei Guo, Shuyang Yu, Hanting Liu, Sim K Goh, Ho-Kin Tang, Daojing He, et al. Parameter competition balancing for model merging. Proc. of Neural Information Processing Systems (NeurIPS) , 37: 0 84746--84776, 2024

work page 2024

[9] [9]

Task singular vectors: Reducing task interference in model merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task interference in model merging. In Proc. of Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025

[10] [10]

The third pascal recognizing textual entailment challenge

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and William B Dolan. The third pascal recognizing textual entailment challenge. In Proc. of ACL-PASCAL Workshop on Textual Entailment and Paraphrasing , pp.\ 1--9, 2007

work page 2007

[11] [11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 770--778, 2016

work page 2016

[12] [12]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2019

work page 2019

[13] [13]

Revisiting scalarization in multi-task learning: A theoretical perspective

Yuzheng Hu, Ruicheng Xian, Qilong Wu, Qiuling Fan, Lang Yin, and Han Zhao. Revisiting scalarization in multi-task learning: A theoretical perspective. Proc. of Neural Information Processing Systems (NeurIPS) , 36, 2024

work page 2024

[14] [14]

Emr-merging: Tuning-free high-performance model merging

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high-performance model merging. Proc. of Neural Information Processing Systems (NeurIPS) , 2024

work page 2024

[15] [15]

Patching open-vocabulary models by interpolating weights

Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, and Ludwig Schmidt. Patching open-vocabulary models by interpolating weights. Proc. of Neural Information Processing Systems (NeurIPS) , 35: 0 29262--29277, 2022

work page 2022

[16] [16]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2023

work page 2023

[17] [17]

First quora dataset release: Question pairs

Shankar Iyer, Nikhil Dandekar, Korn \'e l Csernai, et al. First quora dataset release: Question pairs. data. quora. com. 2017

work page 2017

[18] [18]

Fine-tuning attention modules only: Enhancing weight disentanglement in task arithmetic

Ruochen Jin, Bojian Hou, Jiancong Xiao, Weijie Su, and Li Shen. Fine-tuning attention modules only: Enhancing weight disentanglement in task arithmetic. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2025

work page 2025

[19] [19]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 7482--7491, 2018

work page 2018

[20] [20]

Adam: A method for stochastic optimization

Diederik P Kingma. Adam: A method for stochastic optimization. Proc. of Int'l Conf. on Learning Representation (ICLR) , 2015

work page 2015

[21] [21]

Overcoming catastrophic forgetting in neural networks

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114 0 (13): 0 3521--3526, 2017

work page 2017

[22] [22]

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proc. of Int'l Conf. on Machine Learning (ICML) , volume 3, pp.\ 896. Atlanta, 2013

work page 2013

[23] [23]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y Liu, M Ott, N Goyal, J Du, M Joshi, D Chen, O Levy, M Lewis, L Zettlemoyer, and V Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[24] [24]

No task left behind: Isotropic model merging with common and task-specific subspaces

Daniel Marczak, Simone Magistri, Sebastian Cygert, Bart omiej Twardowski, Andrew D Bagdanov, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2025 a

work page 2025

[25] [25]

Magmax: Leveraging model merging for seamless continual learning

Daniel Marczak, Bart omiej Twardowski, Tomasz Trzci \'n ski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. In Proc. of European Conf. on Computer Vision (ECCV) , pp.\ 379--395. Springer, 2025 b

work page 2025

[26] [26]

Cross-stitch networks for multi-task learning

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 3994--4003, 2016

work page 2016

[27] [27]

Reading digits in natural images with unsupervised feature learning

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In Proc. of Neural Information Processing Systems Workshops (NeurIPSW) , volume 2011, pp.\ 4. Granada, 2011

work page 2011

[28] [28]

Towards calibrated robust fine-tuning of vision-language models

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models. Proc. of Neural Information Processing Systems (NeurIPS) , 2024

work page 2024

[29] [29]

Dawin: Training-free dynamic weight interpolation for robust adaptation

Changdae Oh, Yixuan Li, Kyungwoo Song, Sangdoo Yun, and Dongyoon Han. Dawin: Training-free dynamic weight interpolation for robust adaptation. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2025

work page 2025

[30] [30]

Task arithmetic in the tangent space: Improved editing of pre-trained models

Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. Proc. of Neural Information Processing Systems (NeurIPS) , 36: 0 66727--66754, 2023

work page 2023

[31] [31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In Proc. of Int'l Conf. on Machine Learning (ICML) , pp.\ 8748--8763. PMLR, 2021

work page 2021

[32] [32]

SQ u AD : 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQ u AD : 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP) , pp.\ 2383--2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.1...

work page doi:10.18653/v1/d16-1264 2016

[33] [33]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Proc. of European Conf. on Computer Vision (ECCV) , pp.\ 746--760. Springer, 2012

work page 2012

[34] [34]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP) , pp.\ 1631--1642, 2013

work page 2013

[35] [35]

The german traffic sign recognition benchmark: a multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: a multi-class classification competition. In Proc. of International Joint Conference on Neural Networks , pp.\ 1453--1460. IEEE, 2011

work page 2011

[36] [36]

Fusionbench: A comprehensive benchmark of deep model fusion

Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion. arXiv preprint arXiv:2406.03280, 2024 a

work page arXiv 2024

[37] [37]

Merging multi-task models via weight-ensembling mixture of experts

Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, and Dacheng Tao. Merging multi-task models via weight-ensembling mixture of experts. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024 b

work page 2024

[38] [38]

Glue: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2019

work page 2019

[39] [39]

Localizing task information for improved model merging and compression

Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz - Jim \' e nez, Fran c ois Fleuret, and Pascal Frossard. Localizing task information for improved model merging and compression. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024

work page 2024

[40] [40]

Lines: Post-training layer scaling prevents forgetting and enhances model merging

Ke Wang, Nikolaos Dimitriadis, Alessandro Favero, Guillermo Ortiz-Jimenez, Francois Fleuret, and Pascal Frossard. Lines: Post-training layer scaling prevents forgetting and enhances model merging. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2025

work page 2025

[41] [41]

Neural network acceptability judgments

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics (TACL), 7: 0 625--641, 2019

work page 2019

[42] [42]

Representation surgery in model merging with probabilistic modeling

Qi Wei, Shuo He, Enneng Yang, Tingcong Liu, Haobo Wang, Lei Feng, and Bo An. Representation surgery in model merging with probabilistic modeling. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2025

work page 2025

[43] [43]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics (NAACL) , pp.\ 1112--1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:10.18653/v1/N18-1...

work page internal anchor Pith review doi:10.18653/v1/n18-1101 2018

[44] [44]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proc. of Int'l Conf. on Machine Learning (ICML) , pp.\ 23965--239...

work page 2022

[45] [45]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In Proc. of Computer Vision and Pattern Recognition (CVPR) , pp.\ 7959--7971, 2022 b

work page 2022

[46] [46]

Scalable model merging with progressive layer-wise distillation

Jing Xu, Jiazheng Li, and Jingzhao Zhang. Scalable model merging with progressive layer-wise distillation. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2025

work page 2025

[47] [47]

Ties-merging: Resolving interference when merging models

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. Proc. of Neural Information Processing Systems (NeurIPS) , 36, 2024

work page 2024

[48] [48]

Representation surgery for multi-task model merging

Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024 a

work page 2024

[49] [49]

Adamerging: Adaptive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In Proc. of Int'l Conf. on Learning Representation (ICLR) , 2024 b

work page 2024

[50] [50]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024

work page 2024

[51] [51]

Gradient surgery for multi-task learning

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Proc. of Neural Information Processing Systems (NeurIPS) , 33: 0 5824--5836, 2020

work page 2020

[52] [52]

Free-merging: Fourier transform for efficient model merging

Shenghe Zheng and Hongzhi Wang. Free-merging: Fourier transform for efficient model merging. In Proc. of Int'l Conf. on Computer Vision (ICCV) , 2025

work page 2025

[53] [53]

On the emergence of cross-task linearity in the pretraining-finetuning paradigm

Zhanpeng Zhou, Zijun Chen, Yilan Chen, Bo Zhang, and Junchi Yan. On the emergence of cross-task linearity in the pretraining-finetuning paradigm. In Proc. of Int'l Conf. on Machine Learning (ICML) , 2024

work page 2024

[54] [54]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[55] [55]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[56] [56]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[57] [57]

Most results are obtained on an NVIDIA RTX 4090 GPU, while experiments involving ViT-L/14 are performed on an NVIDIA RTX A6000 GPU

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv