arxiv: 2603.02945 · v2 · submitted 2026-03-03 · 💻 cs.CL

Recognition: no theorem link

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

Bo Xu , Haotian Wu , Hehai Lin , Weiquan Huang , Beier Zhu , Yao Shu , Chengwei Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords model mergingdata-free mergingcovariance estimationparameter differencestask interferenceGPT-2vision benchmarkslanguage models

0 comments

The pith

Parameter differences between base and fine-tuned models encode the input covariances needed for optimal data-free merging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Merging multiple task-specific models into one often fails due to interference when tasks have different objectives. The paper establishes that each task's input covariance, a key factor in reducing that interference, can be recovered directly from the parameter shifts between the original model and its fine-tuned version. This recovery works without any access to training data, retraining, or changes to model architecture. The resulting closed-form method, ACE-Merging, delivers measurable gains over prior data-free baselines, including a 4% average absolute improvement across seven tasks on GPT-2. The approach therefore makes practical multi-task combination feasible in settings where data cannot be shared or reused.

Core claim

Theoretical analysis shows that the input covariance of each task is implicitly recoverable from the parameter differences of its fine-tuned model, even without data. ACE-Merging builds an adaptive covariance estimation framework on this relation and supplies a closed-form solution for merging that directly counters inter-task interference. Experiments across vision and language benchmarks confirm that the method outperforms existing data-free baselines while remaining computationally modest.

What carries the argument

Adaptive Covariance Estimation (ACE) that treats parameter differences as implicit estimators of task-specific input covariances to produce a closed-form merging solution.

If this is right

Merging succeeds across vision and language tasks without requiring data access or retraining steps.
A closed-form solution replaces prior iterative or heuristic merging procedures.
Consistent absolute gains of around 4% appear on GPT-2 across seven tasks relative to earlier data-free baselines.
The method scales to both vision and language benchmarks with only modest extra computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same estimation principle could be tested for sequential addition of new tasks without recomputing the full set of covariances.
If the recovered covariances prove stable across different fine-tuning runs, the approach might apply to models trained under varying hyperparameters.
Extensions could examine whether the same parameter-difference signal supports merging when tasks arrive from entirely separate model families.

Load-bearing premise

That the observed parameter differences between a base model and its fine-tuned counterpart carry sufficient information to recover the relevant input covariances of each task.

What would settle it

Directly compute the true input covariances from held-out task data and compare them to the estimates derived solely from parameter differences; substantial mismatch would refute the estimation claim.

Figures

Figures reproduced from arXiv: 2603.02945 by Beier Zhu, Bo Xu, Chengwei Qin, Haotian Wu, Hehai Lin, Weiquan Huang, Yao Shu.

**Figure 3.** Figure 3: Empirical distributions of ∆Wt across architectures, layer types, and tasks. Each subplot shows the histogram of the entries of ∆Wt for one representative layer from RoBERTa-Base, GPT-2, and ViT-B/16. The distributions are consistently zero-centered and approximately Gaussian, supporting the local zero-mean assumption used in our theoretical analysis and suggesting that the dominant task-specific informati… view at source ↗

**Figure 5.** Figure 5: GPT-2 (7 tasks). Larger heterogeneity γ than RoBERTa, indicating stronger mismatch across task updates. Emb QKV A-Out Fc Proj Layer Type 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 H eterogeneity M etric ( ) (a) Distribution Across Layer Types Layer Types Embedding Attn-In Attn-Out MLP-Fc MLP-Proj 0 10 20 30 40 50 60 70 80 90 100 Layer Index (b) Layer-wise Profile =0.3 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 7.** Figure 7: ViT-L/14 (20 tasks). Substantially larger heterogeneity γ, showing increasing mismatch across task updates as the number of merged tasks grows. Summary. Across all architectures, γ consistently captures the heterogeneity of the task set: • small γ indicates relatively well-aligned task scales, for which simple merging procedures are often sufficient; • large γ reveals pronounced variability across tasks, m… view at source ↗

read the original abstract

Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACE-Merging gives a closed-form data-free merging method that beats baselines on GPT-2 by estimating covariances from weight deltas, but the inversion step looks fragile under real fine-tuning.

read the letter

The core idea is that you can pull task input covariances straight out of the parameter differences between a base model and its fine-tuned version, then plug that into a merging formula without any data. They turn this into ACE-Merging, a closed-form solution that avoids the iterative tweaks in earlier work. On the empirical side it delivers: 4% average lift over prior data-free methods across seven GPT-2 tasks, plus gains on vision benchmarks, all at modest extra cost. That is a concrete step forward for combining experts when data sharing is off the table.

Referee Report

2 major / 2 minor

Summary. The paper claims that the input covariance of each task can be implicitly estimated in closed form from the parameter differences between a base model and its fine-tuned version, even without data access. Building on this, it introduces ACE-Merging, an adaptive covariance estimation framework with a principled closed-form solution for data-free model merging that reduces inter-task interference. Experiments on vision and language benchmarks (including GPT-2) show it outperforms prior data-free methods, with a reported 4% average absolute gain across seven tasks.

Significance. If the theoretical inversion from parameter deltas to task covariances holds under realistic fine-tuning conditions, the work would supply a computationally efficient, non-iterative alternative to existing data-free merging heuristics and could meaningfully improve multi-task performance without retraining or data sharing.

major comments (2)

[§3] §3 (Theoretical Analysis): The derivation of the covariance estimate from ΔW must be shown to be independent of the specific optimizer (Adam), learning-rate schedule, and multi-epoch training used in the GPT-2 experiments; if the closed-form solution is obtained only under a quadratic single-step assumption, the data-free claim does not automatically transfer to the reported language-model results.
[§4.2] §4.2 (GPT-2 Experiments): The 4% average improvement is presented as evidence that the estimated covariances capture task-specific input statistics, yet no ablation isolates the contribution of the covariance term from other merging components; without this, it remains unclear whether the gains refute or are consistent with the circularity concern that ΔW-derived quantities are being used both to estimate and to correct the merge.

minor comments (2)

[Abstract] The abstract introduces the acronym ACE-Merging without spelling out the full name on first use.
[Notation] Notation for covariance matrices (e.g., Σ vs. C) should be unified across the theoretical and experimental sections to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which have helped us strengthen the presentation of our theoretical and empirical contributions. We address each major comment below and have revised the manuscript to incorporate clarifications and additional analysis where appropriate.

read point-by-point responses

Referee: [§3] §3 (Theoretical Analysis): The derivation of the covariance estimate from ΔW must be shown to be independent of the specific optimizer (Adam), learning-rate schedule, and multi-epoch training used in the GPT-2 experiments; if the closed-form solution is obtained only under a quadratic single-step assumption, the data-free claim does not automatically transfer to the reported language-model results.

Authors: We agree that the core derivation in §3 relies on a quadratic loss and single-step gradient update to obtain the closed-form covariance estimate from ΔW. This assumption enables the exact inversion but does not strictly hold for multi-epoch Adam training. In the revised manuscript we have added a dedicated paragraph in §3.3 that (i) explicitly states the quadratic single-step assumption, (ii) derives a first-order error bound for multi-step and adaptive-optimizer cases, and (iii) reports a small-scale simulation confirming that the estimate remains directionally accurate under realistic fine-tuning schedules. While a fully general proof for arbitrary optimizers lies outside the present scope, the added analysis shows that the data-free claim transfers to the GPT-2 setting via this controlled approximation, consistent with the observed performance gains. revision: partial
Referee: [§4.2] §4.2 (GPT-2 Experiments): The 4% average improvement is presented as evidence that the estimated covariances capture task-specific input statistics, yet no ablation isolates the contribution of the covariance term from other merging components; without this, it remains unclear whether the gains refute or are consistent with the circularity concern that ΔW-derived quantities are being used both to estimate and to correct the merge.

Authors: We thank the referee for raising this important methodological point. In the revised §4.2 we now include an explicit ablation that replaces the estimated covariance matrices with isotropic (identity-scaled) matrices while keeping all other components of ACE-Merging fixed. The results show that the covariance term accounts for roughly 2.8 percentage points of the reported 4% average gain. We have also expanded the discussion to address the potential circularity concern: although both the covariance estimator and the merging formula operate on ΔW, the estimator extracts a second-moment statistic via the theoretically derived mapping, which is then inserted into the closed-form solution; the two uses are therefore sequential and non-tautological. The new ablation and clarification together demonstrate that the performance lift is attributable to the covariance estimation step rather than to any circular reuse of the same quantity. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on explicit inversion formula under stated assumptions, independent of fitted outputs

full rationale

The paper derives an explicit closed-form mapping from observed parameter differences Delta W to an estimate of task input covariance Sigma under a quadratic-loss, single-step gradient assumption. This mapping is presented as a mathematical identity derived from the fine-tuning update rule rather than a fit to the target merging performance. The subsequent merging weights are then computed from the estimated Sigma values; the final merged model is not forced to reproduce the input deltas by construction, and the experiments on GPT-2 and vision tasks serve as external empirical checks. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing in the central claim. The derivation is therefore self-contained once the quadratic/one-step modeling assumption is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven link between parameter differences and task covariances; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

axioms (1)

domain assumption parameter differences between base and fine-tuned models implicitly encode task input covariance
Invoked as the foundation for the theoretical analysis and closed-form solution

pith-pipeline@v0.9.0 · 5531 in / 1134 out tokens · 46314 ms · 2026-05-15T16:55:07.933599+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 6 internal anchors

[1]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1

work page 2017
[2]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 1

work page 2019
[3]

Multitask learning.Machine learning, 28(1):41–75, 1997

Rich Caruana. Multitask learning.Machine learning, 28(1):41–75, 1997. 1

work page 1997
[4]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh Inter- national Conference on Learning Representations, 2023. 1, 2, 6, 7, 8

work page 2023
[5]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averag- ing weights of multiple fine-tuned models improves accu- racy without increasing inference time. InInternational Conference on Machine Learnin...

work page 2022
[6]

Merging models with fisher-weighted averaging.Advances in Neural Information Processing Systems, 35:17703–17716, 2022

Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging.Advances in Neural Information Processing Systems, 35:17703–17716, 2022. 1, 2, 6, 7

work page 2022
[7]

Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849, 2022

Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849, 2022. 6, 7

work page arXiv 2022
[8]

RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

The-Hai Nguyen, Dang Huu-Tien, Takeshi Suzuki, and Le- Minh Nguyen. Regmean++: Enhancing effectiveness and generalization of regression mean for model merging.arXiv preprint arXiv:2508.03121, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Adamerging: Adap- tive model merging for multi-task learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adap- tive model merging for multi-task learning. InInternational Conference on Learning Representations, 2024. 1, 2

work page 2024
[10]

Emr-merging: Tuning-free high- performance model merging.Advances in Neural Informa- tion Processing Systems, 37:122741–122769, 2024

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high- performance model merging.Advances in Neural Informa- tion Processing Systems, 37:122741–122769, 2024. 1, 2

work page 2024
[11]

Ties-merging: Resolving interference when merging models.Advances in Neural Information Pro- cessing Systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raf- fel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Pro- cessing Systems, 36:7093–7115, 2023. 1, 2, 6, 7, 8

work page 2023
[12]

Nan: A training-free solution to coefficient estimation in model merging.arXiv preprint arXiv:2505.16148, 2025

Chongjie Si, Kangtao Lv, Jingjing Jiang, Yadao Wang, Yong- wei Wang, Xiaokang Yang, Wenbo Su, Bo Zheng, and Wei Shen. Nan: A training-free solution to coefficient estimation in model merging.arXiv preprint arXiv:2505.16148, 2025

work page arXiv 2025
[13]

Free-merging: Fourier transform for efficient model merging

Shenghe Zheng and Hongzhi Wang. Free-merging: Fourier transform for efficient model merging. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3863–3873, 2025. 2

work page 2025
[14]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first In- ternational Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024. 1, 2, 8

work page 2024
[15]

Dynamic fisher-weighted model merging via bayesian optimization.arXiv preprint arXiv:2504.18992, 2025

Sanwoo Lee, Jiahao Liu, Qifan Wang, Jingang Wang, Xun- liang Cai, and Yunfang Wu. Dynamic fisher-weighted model merging via bayesian optimization.arXiv preprint arXiv:2504.18992, 2025. 2

work page arXiv 2025
[16]

Local mixtures of experts: Essentially free test-time training via model merging.CoRR, abs/2505.14136, 2025

Ryo Bertolissi, Jonas H ¨ubotter, Ido Hakimi, and Andreas Krause. Local mixtures of experts: Essentially free test-time training via model merging.CoRR, abs/2505.14136, 2025. 2

work page arXiv 2025
[17]

Min- gle: Mixture of null-space gated low-rank experts for test-time continual model merging

Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, and Hongliang Li. Mingle: Mixtures of null- space gated low-rank experts for test-time continual model merging.arXiv preprint arXiv:2505.11883, 2025. 2

work page arXiv 2025
[18]

Della-merging: Reducing interference in model merg- ing through magnitude-based sampling.arXiv preprint arXiv:2406.11617, 2024

Pala Tej Deep, Rishabh Bhardwaj, and Soujanya Po- ria. Della-merging: Reducing interference in model merg- ing through magnitude-based sampling.arXiv preprint arXiv:2406.11617, 2024. 2

work page arXiv 2024
[19]

Task singular vectors: Reducing task in- terference in model merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task in- terference in model merging. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 18695–18705, 2025. 2, 6, 7

work page 2025
[20]

Bagdanov, and Joost van de Weijer

Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Andrew D. Bagdanov, and Joost van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. InForty-second International Conference on Machine Learning, 2025. 7

work page 2025
[21]

Model merging with svd to tie the knots

George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with svd to tie the knots. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24- 28, 2025, 2024

work page 2025
[22]

Bag- danov, Simone Calderara, and Joost van de Weijer

Aniello Panariello, Daniel Marczak, Simone Magistri, An- gelo Porrello, Bartłomiej Twardowski, Andrew D. Bag- danov, Simone Calderara, and Joost van de Weijer. Accurate and efficient low-rank model merging in core space.ArXiv, abs/2509.17786, 2025. 2

work page arXiv 2025
[23]

Whoever started the interference should end it: Guiding data-free model merging via task vectors

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors. In Forty-second International Conference on Machine Learn- ing, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025,

work page 2025
[24]

Your transformer is secretly linear.arXiv preprint arXiv:2405.12250, 2024

Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Gon- charova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dim- itrov, and Andrey Kuznetsov. Your transformer is secretly linear.arXiv preprint arXiv:2405.12250, 2024. 2

work page arXiv 2024
[25]

Localizing task in- formation for improved model merging and compression

Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jim ´enez, Franccois Fleuret, and Pascal Frossard. Localizing task in- formation for improved model merging and compression. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, 2024. 7

work page 2024
[26]

Revisiting weight averaging for model merging

Jiho Choi, Donggyun Kim, Chanhyuk Lee, and Seunghoon Hong. Revisiting weight averaging for model merging. ArXiv, abs/2412.12153, 2024. 6, 7

work page arXiv 2024
[27]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 6

work page 2013
[28]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 6

work page 2014
[29]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 6

work page 2019
[30]

The german traffic sign recognition bench- mark: a multi-class classification competition

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition bench- mark: a multi-class classification competition. InThe 2011 international joint conference on neural networks, pages 1453–1460. IEEE, 2011. 6

work page 2011
[31]

Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 2002

Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 2002. 6

work page 2002
[32]

Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017. 6

work page 2017
[33]

Sun database: Exploring a large col- lection of scene categories.International Journal of Com- puter Vision, 119(1):3–22, 2016

Jianxiong Xiao, Krista A Ehinger, James Hays, Antonio Tor- ralba, and Aude Oliva. Sun database: Exploring a large col- lection of scene categories.International Journal of Com- puter Vision, 119(1):3–22, 2016. 6

work page 2016
[34]

Reading digits in natural images with unsupervised feature learning

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- sacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. InNIPS workshop on deep learning and unsupervised feature learn- ing, volume 2011, page 7. Granada, 2011. 6

work page 2011
[35]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report TR- 2009, University of Toronto, Toronto, Ontario, 2009. 6

work page 2009
[36]

An analysis of single-layer networks in unsupervised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 215–223. JMLR Workshop and Conference Proceedings, 2011. 6

work page 2011
[37]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 6

work page 2008
[38]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 6

work page 2012
[39]

Rotation equivariant cnns for digital pathology

Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Co- hen, and Max Welling. Rotation equivariant cnns for digital pathology. InInternational Conference on Medical image computing and computer-assisted intervention, pages 210–

work page
[40]

Chal- lenges in representation learning: A report on three machine learning contests

Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Chal- lenges in representation learning: A report on three machine learning contests. InInternational conference on neural in- formation processing, pages 117–124. Springer, 2013. 6

work page 2013
[41]

Emnist: Extending mnist to handwritten letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017. 6

work page 2017
[42]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 6

work page 2014
[43]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion- mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Recursive deep models for semantic compositional- ity over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositional- ity over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language pro- cessing, pages 1631–1642, 2013. 6

work page 2013
[45]

Deep Learning for Classical Japanese Literature

Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature.arXiv preprint arXiv:1812.01718, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Glue: A multi-task benchmark and analysis platform for natural language un- derstanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language un- derstanding. InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–355, 2018. 6

work page 2018
[47]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 6

work page internal anchor Pith review Pith/arXiv arXiv 2010
[48]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

work page 2021
[49]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.ArXiv, abs/1907.11692, 2019. 6

work page internal anchor Pith review Pith/arXiv arXiv 1907
[50]

Language models are unsu- pervised multitask learners.OpenAI blog, 2019

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 2019. 6

work page 2019
[51]

Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024

Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, and Yu Cheng. Twin-merging: Dynamic integration of modular expertise in model merging.Advances in Neural Information Processing Systems, 37:78905–78935, 2024. 6

work page 2024
[52]

Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024

Anke Tang, Li Shen, Yong Luo, Han Hu, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.arXiv preprint arXiv:2406.03280, 2024. 6

work page arXiv 2024
[53]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.ArXiv, abs/2302.13971, 2023. 6 A. Theoretical Derivations A.1. De...

work page internal anchor Pith review Pith/arXiv arXiv 2023