pith. machine review for the scientific record. sign in

arxiv: 2603.02945 · v2 · submitted 2026-03-03 · 💻 cs.CL

Recognition: no theorem link

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords model mergingdata-free mergingcovariance estimationparameter differencestask interferenceGPT-2vision benchmarkslanguage models
0
0 comments X

The pith

Parameter differences between base and fine-tuned models encode the input covariances needed for optimal data-free merging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Merging multiple task-specific models into one often fails due to interference when tasks have different objectives. The paper establishes that each task's input covariance, a key factor in reducing that interference, can be recovered directly from the parameter shifts between the original model and its fine-tuned version. This recovery works without any access to training data, retraining, or changes to model architecture. The resulting closed-form method, ACE-Merging, delivers measurable gains over prior data-free baselines, including a 4% average absolute improvement across seven tasks on GPT-2. The approach therefore makes practical multi-task combination feasible in settings where data cannot be shared or reused.

Core claim

Theoretical analysis shows that the input covariance of each task is implicitly recoverable from the parameter differences of its fine-tuned model, even without data. ACE-Merging builds an adaptive covariance estimation framework on this relation and supplies a closed-form solution for merging that directly counters inter-task interference. Experiments across vision and language benchmarks confirm that the method outperforms existing data-free baselines while remaining computationally modest.

What carries the argument

Adaptive Covariance Estimation (ACE) that treats parameter differences as implicit estimators of task-specific input covariances to produce a closed-form merging solution.

If this is right

  • Merging succeeds across vision and language tasks without requiring data access or retraining steps.
  • A closed-form solution replaces prior iterative or heuristic merging procedures.
  • Consistent absolute gains of around 4% appear on GPT-2 across seven tasks relative to earlier data-free baselines.
  • The method scales to both vision and language benchmarks with only modest extra computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same estimation principle could be tested for sequential addition of new tasks without recomputing the full set of covariances.
  • If the recovered covariances prove stable across different fine-tuning runs, the approach might apply to models trained under varying hyperparameters.
  • Extensions could examine whether the same parameter-difference signal supports merging when tasks arrive from entirely separate model families.

Load-bearing premise

That the observed parameter differences between a base model and its fine-tuned counterpart carry sufficient information to recover the relevant input covariances of each task.

What would settle it

Directly compute the true input covariances from held-out task data and compare them to the estimates derived solely from parameter differences; substantial mismatch would refute the estimation claim.

Figures

Figures reproduced from arXiv: 2603.02945 by Beier Zhu, Bo Xu, Chengwei Qin, Haotian Wu, Hehai Lin, Weiquan Huang, Yao Shu.

Figure 3
Figure 3. Figure 3: Empirical distributions of ∆Wt across architectures, layer types, and tasks. Each subplot shows the histogram of the entries of ∆Wt for one representative layer from RoBERTa-Base, GPT-2, and ViT-B/16. The distributions are consistently zero-centered and approximately Gaussian, supporting the local zero-mean assumption used in our theoretical analysis and suggesting that the dominant task-specific informati… view at source ↗
Figure 5
Figure 5. Figure 5: GPT-2 (7 tasks). Larger heterogeneity γ than RoBERTa, indicating stronger mismatch across task updates. Emb QKV A-Out Fc Proj Layer Type 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 H eterogeneity M etric ( ) (a) Distribution Across Layer Types Layer Types Embedding Attn-In Attn-Out MLP-Fc MLP-Proj 0 10 20 30 40 50 60 70 80 90 100 Layer Index (b) Layer-wise Profile =0.3 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: ViT-L/14 (20 tasks). Substantially larger heterogeneity γ, showing increasing mismatch across task updates as the number of merged tasks grows. Summary. Across all architectures, γ consistently captures the heterogeneity of the task set: • small γ indicates relatively well-aligned task scales, for which simple merging procedures are often sufficient; • large γ reveals pronounced variability across tasks, m… view at source ↗
read the original abstract

Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the input covariance of each task can be implicitly estimated in closed form from the parameter differences between a base model and its fine-tuned version, even without data access. Building on this, it introduces ACE-Merging, an adaptive covariance estimation framework with a principled closed-form solution for data-free model merging that reduces inter-task interference. Experiments on vision and language benchmarks (including GPT-2) show it outperforms prior data-free methods, with a reported 4% average absolute gain across seven tasks.

Significance. If the theoretical inversion from parameter deltas to task covariances holds under realistic fine-tuning conditions, the work would supply a computationally efficient, non-iterative alternative to existing data-free merging heuristics and could meaningfully improve multi-task performance without retraining or data sharing.

major comments (2)
  1. [§3] §3 (Theoretical Analysis): The derivation of the covariance estimate from ΔW must be shown to be independent of the specific optimizer (Adam), learning-rate schedule, and multi-epoch training used in the GPT-2 experiments; if the closed-form solution is obtained only under a quadratic single-step assumption, the data-free claim does not automatically transfer to the reported language-model results.
  2. [§4.2] §4.2 (GPT-2 Experiments): The 4% average improvement is presented as evidence that the estimated covariances capture task-specific input statistics, yet no ablation isolates the contribution of the covariance term from other merging components; without this, it remains unclear whether the gains refute or are consistent with the circularity concern that ΔW-derived quantities are being used both to estimate and to correct the merge.
minor comments (2)
  1. [Abstract] The abstract introduces the acronym ACE-Merging without spelling out the full name on first use.
  2. [Notation] Notation for covariance matrices (e.g., Σ vs. C) should be unified across the theoretical and experimental sections to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which have helped us strengthen the presentation of our theoretical and empirical contributions. We address each major comment below and have revised the manuscript to incorporate clarifications and additional analysis where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Theoretical Analysis): The derivation of the covariance estimate from ΔW must be shown to be independent of the specific optimizer (Adam), learning-rate schedule, and multi-epoch training used in the GPT-2 experiments; if the closed-form solution is obtained only under a quadratic single-step assumption, the data-free claim does not automatically transfer to the reported language-model results.

    Authors: We agree that the core derivation in §3 relies on a quadratic loss and single-step gradient update to obtain the closed-form covariance estimate from ΔW. This assumption enables the exact inversion but does not strictly hold for multi-epoch Adam training. In the revised manuscript we have added a dedicated paragraph in §3.3 that (i) explicitly states the quadratic single-step assumption, (ii) derives a first-order error bound for multi-step and adaptive-optimizer cases, and (iii) reports a small-scale simulation confirming that the estimate remains directionally accurate under realistic fine-tuning schedules. While a fully general proof for arbitrary optimizers lies outside the present scope, the added analysis shows that the data-free claim transfers to the GPT-2 setting via this controlled approximation, consistent with the observed performance gains. revision: partial

  2. Referee: [§4.2] §4.2 (GPT-2 Experiments): The 4% average improvement is presented as evidence that the estimated covariances capture task-specific input statistics, yet no ablation isolates the contribution of the covariance term from other merging components; without this, it remains unclear whether the gains refute or are consistent with the circularity concern that ΔW-derived quantities are being used both to estimate and to correct the merge.

    Authors: We thank the referee for raising this important methodological point. In the revised §4.2 we now include an explicit ablation that replaces the estimated covariance matrices with isotropic (identity-scaled) matrices while keeping all other components of ACE-Merging fixed. The results show that the covariance term accounts for roughly 2.8 percentage points of the reported 4% average gain. We have also expanded the discussion to address the potential circularity concern: although both the covariance estimator and the merging formula operate on ΔW, the estimator extracts a second-moment statistic via the theoretically derived mapping, which is then inserted into the closed-form solution; the two uses are therefore sequential and non-tautological. The new ablation and clarification together demonstrate that the performance lift is attributable to the covariance estimation step rather than to any circular reuse of the same quantity. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on explicit inversion formula under stated assumptions, independent of fitted outputs

full rationale

The paper derives an explicit closed-form mapping from observed parameter differences Delta W to an estimate of task input covariance Sigma under a quadratic-loss, single-step gradient assumption. This mapping is presented as a mathematical identity derived from the fine-tuning update rule rather than a fit to the target merging performance. The subsequent merging weights are then computed from the estimated Sigma values; the final merged model is not forced to reproduce the input deltas by construction, and the experiments on GPT-2 and vision tasks serve as external empirical checks. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing in the central claim. The derivation is therefore self-contained once the quadratic/one-step modeling assumption is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven link between parameter differences and task covariances; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

axioms (1)
  • domain assumption parameter differences between base and fine-tuned models implicitly encode task input covariance
    Invoked as the foundation for the theoretical analysis and closed-form solution

pith-pipeline@v0.9.0 · 5531 in / 1134 out tokens · 46314 ms · 2026-05-15T16:55:07.933599+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 6 internal anchors

  1. [1]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1

  2. [2]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 1

  3. [3]

    Multitask learning.Machine learning, 28(1):41–75, 1997

    Rich Caruana. Multitask learning.Machine learning, 28(1):41–75, 1997. 1

  4. [4]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh Inter- national Conference on Learning Representations, 2023. 1, 2, 6, 7, 8

  5. [5]

    Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

    Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averag- ing weights of multiple fine-tuned models improves accu- racy without increasing inference time. InInternational Conference on Machine Learnin...

  6. [6]

    Merging models with fisher-weighted averaging.Advances in Neural Information Processing Systems, 35:17703–17716, 2022

    Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging.Advances in Neural Information Processing Systems, 35:17703–17716, 2022. 1, 2, 6, 7

  7. [7]

    Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849, 2022

    Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849, 2022. 6, 7

  8. [8]

    RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

    The-Hai Nguyen, Dang Huu-Tien, Takeshi Suzuki, and Le- Minh Nguyen. Regmean++: Enhancing effectiveness and generalization of regression mean for model merging.arXiv preprint arXiv:2508.03121, 2025. 1, 2

  9. [9]

    Adamerging: Adap- tive model merging for multi-task learning

    Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adap- tive model merging for multi-task learning. InInternational Conference on Learning Representations, 2024. 1, 2

  10. [10]

    Emr-merging: Tuning-free high- performance model merging.Advances in Neural Informa- tion Processing Systems, 37:122741–122769, 2024

    Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high- performance model merging.Advances in Neural Informa- tion Processing Systems, 37:122741–122769, 2024. 1, 2

  11. [11]

    Ties-merging: Resolving interference when merging models.Advances in Neural Information Pro- cessing Systems, 36:7093–7115, 2023

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raf- fel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Pro- cessing Systems, 36:7093–7115, 2023. 1, 2, 6, 7, 8

  12. [12]

    Nan: A training-free solution to coefficient estimation in model merging.arXiv preprint arXiv:2505.16148, 2025

    Chongjie Si, Kangtao Lv, Jingjing Jiang, Yadao Wang, Yong- wei Wang, Xiaokang Yang, Wenbo Su, Bo Zheng, and Wei Shen. Nan: A training-free solution to coefficient estimation in model merging.arXiv preprint arXiv:2505.16148, 2025

  13. [13]

    Free-merging: Fourier transform for efficient model merging

    Shenghe Zheng and Hongzhi Wang. Free-merging: Fourier transform for efficient model merging. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3863–3873, 2025. 2

  14. [14]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first In- ternational Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024. 1, 2, 8

  15. [15]

    Dynamic fisher-weighted model merging via bayesian optimization.arXiv preprint arXiv:2504.18992, 2025

    Sanwoo Lee, Jiahao Liu, Qifan Wang, Jingang Wang, Xun- liang Cai, and Yunfang Wu. Dynamic fisher-weighted model merging via bayesian optimization.arXiv preprint arXiv:2504.18992, 2025. 2

  16. [16]

    Local mixtures of experts: Essentially free test-time training via model merging.CoRR, abs/2505.14136, 2025

    Ryo Bertolissi, Jonas H ¨ubotter, Ido Hakimi, and Andreas Krause. Local mixtures of experts: Essentially free test-time training via model merging.CoRR, abs/2505.14136, 2025. 2

  17. [17]

    Min- gle: Mixture of null-space gated low-rank experts for test-time continual model merging

    Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Mingle: Mixtures of null- space gated low-rank experts for test-time continual model merging.arXiv preprint arXiv:2505.11883, 2025. 2

  18. [18]

    Della-merging: Reducing interference in model merg- ing through magnitude-based sampling.arXiv preprint arXiv:2406.11617, 2024

    Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Po- ria. Della-merging: Reducing interference in model merg- ing through magnitude-based sampling.arXiv preprint arXiv:2406.11617, 2024. 2

  19. [19]

    Task singular vectors: Reducing task in- terference in model merging

    Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task in- terference in model merging. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 18695–18705, 2025. 2, 6, 7

  20. [20]

    Bagdanov, and Joost van de Weijer

    Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Andrew D. Bagdanov, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. InForty-second International Conference on Machine Learning, 2025. 7

  21. [21]

    Model merging with svd to tie the knots

    George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with svd to tie the knots. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24- 28, 2025, 2024

  22. [22]

    Bag- danov, Simone Calderara, and Joost van de Weijer

    Aniello Panariello, Daniel Marczak, Simone Magistri, An- gelo Porrello, Bartłomiej Twardowski, Andrew D. Bag- danov, Simone Calderara, and Joost van de Weijer. Accurate and efficient low-rank model merging in core space.ArXiv, abs/2509.17786, 2025. 2

  23. [23]

    Whoever started the interference should end it: Guiding data-free model merging via task vectors

    Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. In Forty-second International Conference on Machine Learn- ing, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025,

  24. [24]

    Your transformer is secretly linear.arXiv preprint arXiv:2405.12250, 2024

    Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Gon- charova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dim- itrov, and Andrey Kuznetsov. Your transformer is secretly linear.arXiv preprint arXiv:2405.12250, 2024. 2

  25. [25]

    Localizing task in- formation for improved model merging and compression

    Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jim ´enez, Franccois Fleuret, and Pascal Frossard. Localizing task in- formation for improved model merging and compression. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024. 7

  26. [26]

    Revisiting weight averaging for model merging

    Jiho Choi, Donggyun Kim, Chanhyuk Lee, and Seunghoon Hong. Revisiting weight averaging for model merging. ArXiv, abs/2412.12153, 2024. 6, 7

  27. [27]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 6

  28. [28]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 6

  29. [29]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 6

  30. [30]

    The german traffic sign recognition bench- mark: a multi-class classification competition

    Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition bench- mark: a multi-class classification competition. InThe 2011 international joint conference on neural networks, pages 1453–1460. IEEE, 2011. 6

  31. [31]

    Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 2002

    Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 2002. 6

  32. [32]

    Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017. 6

  33. [33]

    Sun database: Exploring a large col- lection of scene categories.International Journal of Com- puter Vision, 119(1):3–22, 2016

    Jianxiong Xiao, Krista A Ehinger, James Hays, Antonio Tor- ralba, and Aude Oliva. Sun database: Exploring a large col- lection of scene categories.International Journal of Com- puter Vision, 119(1):3–22, 2016. 6

  34. [34]

    Reading digits in natural images with unsupervised feature learning

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- sacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. InNIPS workshop on deep learning and unsupervised feature learn- ing, volume 2011, page 7. Granada, 2011. 6

  35. [35]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report TR- 2009, University of Toronto, Toronto, Ontario, 2009. 6

  36. [36]

    An analysis of single-layer networks in unsupervised feature learning

    Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 215–223. JMLR Workshop and Conference Proceedings, 2011. 6

  37. [37]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 6

  38. [38]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 6

  39. [39]

    Rotation equivariant cnns for digital pathology

    Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Co- hen, and Max Welling. Rotation equivariant cnns for digital pathology. InInternational Conference on Medical image computing and computer-assisted intervention, pages 210–

  40. [40]

    Chal- lenges in representation learning: A report on three machine learning contests

    Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Chal- lenges in representation learning: A report on three machine learning contests. InInternational conference on neural in- formation processing, pages 117–124. Springer, 2013. 6

  41. [41]

    Emnist: Extending mnist to handwritten letters

    Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017. 6

  42. [42]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 6

  43. [43]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion- mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017. 6

  44. [44]

    Recursive deep models for semantic compositional- ity over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositional- ity over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language pro- cessing, pages 1631–1642, 2013. 6

  45. [45]

    Deep Learning for Classical Japanese Literature

    Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature.arXiv preprint arXiv:1812.01718, 2018. 6

  46. [46]

    Glue: A multi-task benchmark and analysis platform for natural language un- derstanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language un- derstanding. InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–355, 2018. 6

  47. [47]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 6

  48. [48]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

  49. [49]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.ArXiv, abs/1907.11692, 2019. 6

  50. [50]

    Language models are unsu- pervised multitask learners.OpenAI blog, 2019

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 2019. 6

  51. [51]

    Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

    Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024. 6

  52. [52]

    Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024

    Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024. 6

  53. [53]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.ArXiv, abs/2302.13971, 2023. 6 A. Theoretical Derivations A.1. De...