Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space

Jana Doppa; Long Minh Bui; Phi Le Nguyen; Trong Nghia Hoang; Tuan Anh Le Van; Tung Phi Duc

arxiv: 2607.01689 · v1 · pith:I6F2VRWPnew · submitted 2026-07-02 · 💻 cs.LG · cs.AI

Model Merging as Probabilistic Inference in Fine-Tuning Parameter Space

Long Minh Bui , Tuan Anh Le Van , Tung Phi Duc , Phi Le Nguyen , Jana Doppa , Trong Nghia Hoang This is my paper

Pith reviewed 2026-07-03 17:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords model mergingproduct of expertsenergy-based modelsCauchy distributionfine-tuningmulti-task learningprobabilistic inferenceparameter space

0 comments

The pith

Model merging is recast as probabilistic inference over fine-tuning parameters using a product-of-experts model with Cauchy experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats model merging as inferring a single set of parameters that is compatible with multiple task-specific solutions, each treated as an energy-based expert. Existing geometric merging methods emerge as special cases when the experts are assumed Gaussian, but the observed directional residuals between merged and task-specific models are often heavy-tailed instead. Replacing the Gaussian experts with Cauchy experts produces a heavy-tailed product-of-experts formulation whose inference procedure is provably convergent and yields higher performance on several tasks and architectures.

Core claim

Model merging is formulated as maximum a posteriori inference under a product-of-experts energy model in which each task-specific fine-tuned solution defines an expert over the merged parameter vector. Several prior merging algorithms are recovered exactly when the experts are given Gaussian energy functions; the paper shows that the empirical residuals violate this light-tailed assumption and replaces the experts with Cauchy energies, which admit closed-form or convergent inference while matching the observed tail behavior.

What carries the argument

Product-of-experts (PoE) energy-based model over merged parameters, with each task-specific solution supplying one expert; the switch from Gaussian to Cauchy experts is the central design choice.

If this is right

Any geometric merging rule that can be written as the mode of a Gaussian PoE is a special case of the new framework.
Inference under the Cauchy PoE can be performed with a provably convergent procedure without requiring additional task data.
Performance gains appear across multiple architectures and task combinations when the heavy-tailed design is used.
The framework supplies a statistical criterion for weighting or selecting which task updates to include during merging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same PoE perspective could be applied to other parameter-space operations such as continual learning or federated averaging.
If residuals remain heavy-tailed after scaling or normalization, similar Cauchy designs might improve other inference tasks in high-dimensional parameter spaces.
The approach opens a route to hybrid merging that mixes Gaussian and Cauchy experts depending on observed residual statistics per layer or task.

Load-bearing premise

The directional residuals between a merged model and the individual task-specific models are heavy-tailed rather than light-tailed.

What would settle it

An experiment that measures the tail index of residuals on a new set of fine-tuned models and finds them consistently light-tailed, or that shows the Cauchy-based merging procedure failing to outperform Gaussian baselines on held-out tasks.

Figures

Figures reproduced from arXiv: 2607.01689 by Jana Doppa, Long Minh Bui, Phi Le Nguyen, Trong Nghia Hoang, Tuan Anh Le Van, Tung Phi Duc.

**Figure 2.** Figure 2: Plots of average directional residual over tasks (log-scale) at every layer weight in ViT-B-32 (left column) and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Convergence of the fixed point iteration method (see Algorithm 1) across layers of the merged ViT-L-14 model on [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Lipschitz constant values of F across multiple layers of our PoE-EBM when merging 7 fullfinetuned ViT models. Top: ViT-L-14. Bottom: ViT-B-32. The Lipschitz constant is consistently less than 1, indicating that F is a contracting map, ensuring convergence of our algorithm 1 (see [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical distributions of directional residuals [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical distributions of directional residuals [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of γ values on the merging accuracies of PoE-EBM on the performance accuracy when merging 7 full fine-tuned ViT models. γ = 0.3 and gradually decreases as γ increases to 3. Even at γ = 2, PoE-EBM still outperforms all the baselines compared with accuracy 83.36%. For ViT-L-14, the performance remains relatively stable as the accuracy gradually increases with smaller values of γ. However, at extremely… view at source ↗

read the original abstract

Model merging aims to combine existing single-task solutions into a multi-task solution without additional data-driven fine-tuning.~Most existing approaches achieve this using geometric properties of local solution spaces. However, such geometric views provide limited guidance for scoring how statistically useful each task-specific update direction is across tasks during merging. To address this, we formulate model merging from a new perspective of probabilistic inference under a product-of-experts (PoE) scenario where each single-task solution defines an energy-based expert model (EBM) over the merged parameters. We show that several existing model merging methods arise as special cases of our framework under energy designs that impose implicit Gaussian assumptions on directional residuals between merged and task-specific models. Empirically, we find that these residuals are often heavy-tailed which exposes a mismatch with the imposed light-tailed Gaussian structures. We address this with a heavy-tailed PoE design based on Cauchy experts, which better captures the observed residual behavior while admitting a provably convergent inference procedure. Experiments across multiple tasks and architectures show significant improvements over state-of-the-arts baselines. Our code is available at https://github.com/MinhLong210/PoE-EBM-Merging.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The PoE-EBM framing unifies some merging methods as Gaussian special cases and adds a Cauchy variant with reported gains, but the heavy-tailed residual premise is the load-bearing part that still needs robustness checks.

read the letter

The paper treats model merging as product-of-experts inference, with each task model as an energy-based expert over the merged parameters. Existing geometric methods fall out as special cases under Gaussian energy on directional residuals. They then check those residuals on their data, see heavy tails, and switch to Cauchy experts that still allow convergent inference. Experiments across tasks and architectures beat the baselines they compare against.

This gives a single probabilistic story that organizes some prior work, and releasing the code is useful. The unification itself is the clearest contribution.

The soft spot is exactly the one in the stress-test note. The switch to Cauchy rests on the claim that residuals are typically heavy-tailed. If that pattern is sensitive to the tasks, architectures, or how residuals are measured, then the gains could trace to other implementation details rather than the distributional match. The abstract does not show extensive checks on that point.

This is for people already working on model merging or parameter-space multi-task methods. A reader looking for a new organizing lens could pick up ideas here even if they end up using something else.

I would send it to peer review. The framework is internally consistent and they have concrete results, so referees can pressure-test the empirical premise directly.

Referee Report

2 major / 2 minor

Summary. The paper formulates model merging as probabilistic inference under a product-of-experts (PoE) model in which each task-specific fine-tuned solution defines an energy-based expert over the merged parameters. It shows that existing merging methods arise as special cases under energy designs that implicitly assume Gaussian distributions on directional residuals. The authors report that these residuals are empirically heavy-tailed, motivating a Cauchy-expert PoE design that matches the observed behavior and admits a provably convergent inference procedure. Experiments across tasks and architectures demonstrate improvements over baselines, with code released.

Significance. If the central claims hold, the work supplies a unifying probabilistic lens on model merging and a concrete alternative to Gaussian-based methods. The explicit release of code at the provided GitHub repository is a clear strength that supports reproducibility and follow-up work. The framework could guide future designs that respect statistical properties of fine-tuning parameter spaces.

major comments (2)

[Empirical residual analysis] The heavy-tailed residual observation (abstract and the empirical analysis section) is load-bearing for the justification of the Cauchy design. The paper should demonstrate that this property is robust rather than an artifact of the chosen tasks, architectures, or residual definitions, for example by reporting tail indices or QQ-plot statistics across at least two additional model scales or fine-tuning regimes not used in the main experiments.
[Inference procedure] The claim of a 'provably convergent inference procedure' for the Cauchy PoE (inference section) is central to the method's practicality. The manuscript should state the precise convergence conditions or rate and include a direct comparison (e.g., iteration counts or stability metrics) against the Gaussian baseline in the experimental tables to show the practical benefit.

minor comments (2)

The phrase 'state-of-the-arts baselines' in the abstract should read 'state-of-the-art baselines'.
[Experiments] Table captions and axis labels in the experimental figures would benefit from explicit mention of the number of random seeds or runs used to compute reported means.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [Empirical residual analysis] The heavy-tailed residual observation (abstract and the empirical analysis section) is load-bearing for the justification of the Cauchy design. The paper should demonstrate that this property is robust rather than an artifact of the chosen tasks, architectures, or residual definitions, for example by reporting tail indices or QQ-plot statistics across at least two additional model scales or fine-tuning regimes not used in the main experiments.

Authors: We agree that additional checks would strengthen the justification. In the revised manuscript we will extend the residual analysis to at least two further model scales and fine-tuning regimes outside the current experimental set, reporting tail indices and QQ-plot statistics to confirm robustness of the heavy-tailed observation. revision: yes
Referee: [Inference procedure] The claim of a 'provably convergent inference procedure' for the Cauchy PoE (inference section) is central to the method's practicality. The manuscript should state the precise convergence conditions or rate and include a direct comparison (e.g., iteration counts or stability metrics) against the Gaussian baseline in the experimental tables to show the practical benefit.

Authors: We will revise the inference section to state the precise convergence conditions and rates for the Cauchy PoE procedure. We will also augment the experimental tables with iteration counts and stability metrics comparing the Cauchy and Gaussian baselines to illustrate the practical advantage. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation follows from standard PoE inference and external empirical observation

full rationale

The paper starts from the standard product-of-experts formulation over energy-based models on merged parameters, derives existing merging methods as special cases under Gaussian residual assumptions via explicit energy designs, and then introduces Cauchy experts after reporting an independent empirical finding that residuals are heavy-tailed. None of these steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the heavy-tailed premise is presented as data-driven evidence external to the formal derivation. The framework remains self-contained against external benchmarks with no load-bearing reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract, no free parameters or invented entities are explicitly mentioned; the approach builds on existing EBM and PoE concepts.

axioms (1)

domain assumption Model merging can be viewed as probabilistic inference in a product-of-experts scenario over fine-tuning parameter space
This is the foundational new perspective introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5752 in / 1182 out tokens · 30074 ms · 2026-07-03T17:49:46.985498+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 18 canonical work pages · 5 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

Neural Information Processing Systems , year=

Statistical Model Aggregation via Parameter Matching , author=. Neural Information Processing Systems , year=
[3]

Proceedings of the 36th International Conference on Machine Learning , series =

Collective Model Fusion for Multiple Black-Box Experts , author =. Proceedings of the 36th International Conference on Machine Learning , series =
[4]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume =

Few-Shot Learning via Repurposing Ensemble of Black-Box Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume =
[5]

Proceedings of the 37th International Conference on Machine Learning , pages =

Learning Task-Agnostic Embedding of Multiple Black-Box Experts for Multi-Task Model Fusion , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

2020
[6]

Proceedings of the 38th International Conference on Machine Learning , pages =

Model Fusion for Personalized Learning , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

2021
[7]

and Deems, Stephen and Furlani, Thomas R

Boerner, Timothy J. and Deems, Stephen and Furlani, Thomas R. and Knuth, Shelley L. and Towns, John , title =. Practice and Experience in Advanced Research Computing 2023: Computing for the Common Good , pages =. 2023 , isbn =. doi:10.1145/3569951.3597559 , abstract =

work page doi:10.1145/3569951.3597559 2023
[8]

International Conference on Artificial Intelligence and Statistics , year=

Communication-Efficient Learning of Deep Networks from Decentralized Data , author=. International Conference on Artificial Intelligence and Statistics , year=
[9]

2018 IEEE Spoken Language Technology Workshop (SLT) , year=

A Re-Ranker Scheme For Integrating Large Scale NLU Models , author=. 2018 IEEE Spoken Language Technology Workshop (SLT) , year=

2018
[10]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[11]

M. J. Kearns , title =
[12]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[13]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[14]

Suppressed for Anonymity , author=
[15]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[16]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[17]

arXiv preprint arXiv:2406.11385 , year=

Metagpt: Merging large language models using model exclusive task arithmetic , author=. arXiv preprint arXiv:2406.11385 , year=

work page arXiv
[18]

Forty-first International Conference on Machine Learning , year=

Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=
[19]

http://yann

The MNIST database of handwritten digits , author=. http://yann. lecun. com/exdb/mnist/ , year=
[20]

Proceedings of the IEEE international conference on computer vision workshops , pages=

3d object representations for fine-grained categorization , author=. Proceedings of the IEEE international conference on computer vision workshops , pages=
[21]

Concrete subspace learning based interference elimination for multi-task model fusion.arXiv preprint arXiv:2312.06173,

Concrete subspace learning based interference elimination for multi-task model fusion , author=. arXiv preprint arXiv:2312.06173 , year=

work page arXiv
[22]

Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

2018
[23]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv preprint arXiv:1708.07747 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Learning multiple layers of features from tiny images.(2009) , author=

2009
[25]

European Conference on Computer Vision , year =

Food-101 -- Mining Discriminative Components with Random Forests , author =. European Conference on Computer Vision , year =
[26]

2008 Sixth Indian conference on computer vision, graphics & image processing , pages=

Automated flower classification over a large number of classes , author=. 2008 Sixth Indian conference on computer vision, graphics & image processing , pages=. 2008 , organization=

2008
[27]

Proceedings of the IEEE , volume=

Remote sensing image scene classification: Benchmark and state of the art , author=. Proceedings of the IEEE , volume=. 2017 , publisher=

2017
[28]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Describing textures in the wild , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[29]

The 2011 international joint conference on neural networks , pages=

The German traffic sign recognition benchmark: a multi-class classification competition , author=. The 2011 international joint conference on neural networks , pages=. 2011 , organization=

2011
[30]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=. 2019 , publisher=

2019
[31]

NIPS workshop on deep learning and unsupervised feature learning , volume=

Reading digits in natural images with unsupervised feature learning , author=. NIPS workshop on deep learning and unsupervised feature learning , volume=. 2011 , organization=

2011
[32]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[33]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
[34]

2014 4th IEEE International Conference on Information Science and Technology , pages=

On the robustness and generalization of cauchy regression , author=. 2014 4th IEEE International Conference on Information Science and Technology , pages=. 2014 , organization=

2014
[35]

arXiv preprint arXiv:2406.07529 , year=

Map: Low-compute model merging with amortized pareto fronts via quadratic approximation , author=. arXiv preprint arXiv:2406.07529 , year=

work page arXiv
[36]

arXiv preprint arXiv:2410.19735 , year=

Model merging with svd to tie the knots , author=. arXiv preprint arXiv:2410.19735 , year=

work page arXiv
[37]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Task singular vectors: Reducing task interference in model merging , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[38]

Advances in Neural Information Processing Systems , volume=

Ties-merging: Resolving interference when merging models , author=. Advances in Neural Information Processing Systems , volume=
[39]

Advances in Neural Information Processing Systems , volume=

Task arithmetic in the tangent space: Improved editing of pre-trained models , author=. Advances in Neural Information Processing Systems , volume=
[40]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Robust fine-tuning of zero-shot models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[41]

Editing Models with Task Arithmetic

Editing models with task arithmetic , author=. arXiv preprint arXiv:2212.04089 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

2008 IEEE international conference on acoustics, speech and signal processing , pages=

Generalized Cauchy distribution based robust estimation , author=. 2008 IEEE international conference on acoustics, speech and signal processing , pages=. 2008 , organization=

2008
[43]

Modeling multi- task model merging as adaptive projective gradient descent.arXiv preprint arXiv:2501.01230,

Modeling multi-task model merging as adaptive projective gradient descent , author=. arXiv preprint arXiv:2501.01230 , year=

work page arXiv
[44]

Advances in Neural Information Processing Systems , volume=

Merging models with fisher-weighted averaging , author=. Advances in Neural Information Processing Systems , volume=
[45]

Southern African Conference for Artificial Intelligence Research , pages=

Cauchy loss function: Robustness under gaussian and Cauchy noise , author=. Southern African Conference for Artificial Intelligence Research , pages=. 2022 , organization=

2022
[46]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[47]

arXiv preprint arXiv:2508.16082 , year=

On Task Vectors and Gradients , author=. arXiv preprint arXiv:2508.16082 , year=

work page arXiv
[48]

arXiv preprint arXiv:2310.12808 , year=

Model merging by uncertainty-based gradient matching , author=. arXiv preprint arXiv:2310.12808 , year=

work page arXiv
[49]

1991 , publisher=

Introductory functional analysis with applications , author=. 1991 , publisher=

1991
[50]

IEEE Transactions on Knowledge and Data Engineering , volume=

A survey on federated learning systems: Vision, hype and reality for data privacy and protection , author=. IEEE Transactions on Knowledge and Data Engineering , volume=. 2021 , publisher=

2021
[51]

arXiv preprint arXiv:2212.09849 , year=

Dataless knowledge fusion by merging weights of language models , author=. arXiv preprint arXiv:2212.09849 , year=

work page arXiv
[52]

arXiv preprint arXiv:2410.10801 , year=

Mix data or merge models? optimizing for diverse multi-task learning , author=. arXiv preprint arXiv:2410.10801 , year=

work page arXiv
[53]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

arXiv preprint arXiv:2101.03288 , year=

How to train your energy-based models , author=. arXiv preprint arXiv:2101.03288 , year=

work page arXiv
[55]

Predicting structured data , volume=

A tutorial on energy-based learning , author=. Predicting structured data , volume=
[56]

Journal of Machine Learning Research , volume=

Energy-based models for sparse overcomplete representations , author=. Journal of Machine Learning Research , volume=
[57]

Neural computation , volume=

Training products of experts by minimizing contrastive divergence , author=. Neural computation , volume=. 2002 , publisher=

2002
[58]

Deep model fusion: A survey.arXiv preprint arXiv:2309.15698,

Deep model fusion: A survey , author=. arXiv preprint arXiv:2309.15698 , year=

work page arXiv
[59]

ACM Computing Surveys , year=

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities , author=. ACM Computing Surveys , year=
[60]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Securing Distributed Gradient Descent in High Dimensional Statistical Learning

Securing distributed machine learning in high dimensions , author=. arXiv preprint arXiv:1804.10140 , volume=. 2018 , publisher=

work page internal anchor Pith review Pith/arXiv arXiv 2018
[62]

arXiv preprint arXiv:2410.13638 , year=

Scaling wearable foundation models , author=. arXiv preprint arXiv:2410.13638 , year=

work page arXiv
[63]

Advances in Neural Information Processing Systems , volume=

Federated learning from vision-language foundation models: Theoretical analysis and method , author=. Advances in Neural Information Processing Systems , volume=
[64]

Medical image analysis , volume=

On the challenges and perspectives of foundation models for medical image analysis , author=. Medical image analysis , volume=. 2024 , publisher=

2024
[65]

International Conference on Blockchain and Trustworthy Systems , pages=

Vision foundation models in medical image analysis: Advances and challenges , author=. International Conference on Blockchain and Trustworthy Systems , pages=. 2025 , organization=

2025
[66]

2018 International Conference on Computational Science and Computational Intelligence (CSCI) , pages=

Deep learning at the edge , author=. 2018 International Conference on Computational Science and Computational Intelligence (CSCI) , pages=. 2018 , organization=

2018

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[2] [2]

Neural Information Processing Systems , year=

Statistical Model Aggregation via Parameter Matching , author=. Neural Information Processing Systems , year=

[3] [3]

Proceedings of the 36th International Conference on Machine Learning , series =

Collective Model Fusion for Multiple Black-Box Experts , author =. Proceedings of the 36th International Conference on Machine Learning , series =

[4] [4]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume =

Few-Shot Learning via Repurposing Ensemble of Black-Box Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume =

[5] [5]

Proceedings of the 37th International Conference on Machine Learning , pages =

Learning Task-Agnostic Embedding of Multiple Black-Box Experts for Multi-Task Model Fusion , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

2020

[6] [6]

Proceedings of the 38th International Conference on Machine Learning , pages =

Model Fusion for Personalized Learning , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

2021

[7] [7]

and Deems, Stephen and Furlani, Thomas R

Boerner, Timothy J. and Deems, Stephen and Furlani, Thomas R. and Knuth, Shelley L. and Towns, John , title =. Practice and Experience in Advanced Research Computing 2023: Computing for the Common Good , pages =. 2023 , isbn =. doi:10.1145/3569951.3597559 , abstract =

work page doi:10.1145/3569951.3597559 2023

[8] [8]

International Conference on Artificial Intelligence and Statistics , year=

Communication-Efficient Learning of Deep Networks from Decentralized Data , author=. International Conference on Artificial Intelligence and Statistics , year=

[9] [9]

2018 IEEE Spoken Language Technology Workshop (SLT) , year=

A Re-Ranker Scheme For Integrating Large Scale NLU Models , author=. 2018 IEEE Spoken Language Technology Workshop (SLT) , year=

2018

[10] [10]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[11] [11]

M. J. Kearns , title =

[12] [12]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[13] [13]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[14] [14]

Suppressed for Anonymity , author=

[15] [15]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[16] [16]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[17] [17]

arXiv preprint arXiv:2406.11385 , year=

Metagpt: Merging large language models using model exclusive task arithmetic , author=. arXiv preprint arXiv:2406.11385 , year=

work page arXiv

[18] [18]

Forty-first International Conference on Machine Learning , year=

Language models are super mario: Absorbing abilities from homologous models as a free lunch , author=. Forty-first International Conference on Machine Learning , year=

[19] [19]

http://yann

The MNIST database of handwritten digits , author=. http://yann. lecun. com/exdb/mnist/ , year=

[20] [20]

Proceedings of the IEEE international conference on computer vision workshops , pages=

3d object representations for fine-grained categorization , author=. Proceedings of the IEEE international conference on computer vision workshops , pages=

[21] [21]

Concrete subspace learning based interference elimination for multi-task model fusion.arXiv preprint arXiv:2312.06173,

Concrete subspace learning based interference elimination for multi-task model fusion , author=. arXiv preprint arXiv:2312.06173 , year=

work page arXiv

[22] [22]

Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

2018

[23] [23]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv preprint arXiv:1708.07747 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Learning multiple layers of features from tiny images.(2009) , author=

2009

[25] [25]

European Conference on Computer Vision , year =

Food-101 -- Mining Discriminative Components with Random Forests , author =. European Conference on Computer Vision , year =

[26] [26]

2008 Sixth Indian conference on computer vision, graphics & image processing , pages=

Automated flower classification over a large number of classes , author=. 2008 Sixth Indian conference on computer vision, graphics & image processing , pages=. 2008 , organization=

2008

[27] [27]

Proceedings of the IEEE , volume=

Remote sensing image scene classification: Benchmark and state of the art , author=. Proceedings of the IEEE , volume=. 2017 , publisher=

2017

[28] [28]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Describing textures in the wild , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[29] [29]

The 2011 international joint conference on neural networks , pages=

The German traffic sign recognition benchmark: a multi-class classification competition , author=. The 2011 international joint conference on neural networks , pages=. 2011 , organization=

2011

[30] [30]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=. 2019 , publisher=

2019

[31] [31]

NIPS workshop on deep learning and unsupervised feature learning , volume=

Reading digits in natural images with unsupervised feature learning , author=. NIPS workshop on deep learning and unsupervised feature learning , volume=. 2011 , organization=

2011

[32] [32]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[33] [33]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

[34] [34]

2014 4th IEEE International Conference on Information Science and Technology , pages=

On the robustness and generalization of cauchy regression , author=. 2014 4th IEEE International Conference on Information Science and Technology , pages=. 2014 , organization=

2014

[35] [35]

arXiv preprint arXiv:2406.07529 , year=

Map: Low-compute model merging with amortized pareto fronts via quadratic approximation , author=. arXiv preprint arXiv:2406.07529 , year=

work page arXiv

[36] [36]

arXiv preprint arXiv:2410.19735 , year=

Model merging with svd to tie the knots , author=. arXiv preprint arXiv:2410.19735 , year=

work page arXiv

[37] [37]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Task singular vectors: Reducing task interference in model merging , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[38] [38]

Advances in Neural Information Processing Systems , volume=

Ties-merging: Resolving interference when merging models , author=. Advances in Neural Information Processing Systems , volume=

[39] [39]

Advances in Neural Information Processing Systems , volume=

Task arithmetic in the tangent space: Improved editing of pre-trained models , author=. Advances in Neural Information Processing Systems , volume=

[40] [40]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Robust fine-tuning of zero-shot models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[41] [41]

Editing Models with Task Arithmetic

Editing models with task arithmetic , author=. arXiv preprint arXiv:2212.04089 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

2008 IEEE international conference on acoustics, speech and signal processing , pages=

Generalized Cauchy distribution based robust estimation , author=. 2008 IEEE international conference on acoustics, speech and signal processing , pages=. 2008 , organization=

2008

[43] [43]

Modeling multi- task model merging as adaptive projective gradient descent.arXiv preprint arXiv:2501.01230,

Modeling multi-task model merging as adaptive projective gradient descent , author=. arXiv preprint arXiv:2501.01230 , year=

work page arXiv

[44] [44]

Advances in Neural Information Processing Systems , volume=

Merging models with fisher-weighted averaging , author=. Advances in Neural Information Processing Systems , volume=

[45] [45]

Southern African Conference for Artificial Intelligence Research , pages=

Cauchy loss function: Robustness under gaussian and Cauchy noise , author=. Southern African Conference for Artificial Intelligence Research , pages=. 2022 , organization=

2022

[46] [46]

International conference on machine learning , pages=

Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

2015

[47] [47]

arXiv preprint arXiv:2508.16082 , year=

On Task Vectors and Gradients , author=. arXiv preprint arXiv:2508.16082 , year=

work page arXiv

[48] [48]

arXiv preprint arXiv:2310.12808 , year=

Model merging by uncertainty-based gradient matching , author=. arXiv preprint arXiv:2310.12808 , year=

work page arXiv

[49] [49]

1991 , publisher=

Introductory functional analysis with applications , author=. 1991 , publisher=

1991

[50] [50]

IEEE Transactions on Knowledge and Data Engineering , volume=

A survey on federated learning systems: Vision, hype and reality for data privacy and protection , author=. IEEE Transactions on Knowledge and Data Engineering , volume=. 2021 , publisher=

2021

[51] [51]

arXiv preprint arXiv:2212.09849 , year=

Dataless knowledge fusion by merging weights of language models , author=. arXiv preprint arXiv:2212.09849 , year=

work page arXiv

[52] [52]

arXiv preprint arXiv:2410.10801 , year=

Mix data or merge models? optimizing for diverse multi-task learning , author=. arXiv preprint arXiv:2410.10801 , year=

work page arXiv

[53] [53]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

arXiv preprint arXiv:2101.03288 , year=

How to train your energy-based models , author=. arXiv preprint arXiv:2101.03288 , year=

work page arXiv

[55] [55]

Predicting structured data , volume=

A tutorial on energy-based learning , author=. Predicting structured data , volume=

[56] [56]

Journal of Machine Learning Research , volume=

Energy-based models for sparse overcomplete representations , author=. Journal of Machine Learning Research , volume=

[57] [57]

Neural computation , volume=

Training products of experts by minimizing contrastive divergence , author=. Neural computation , volume=. 2002 , publisher=

2002

[58] [58]

Deep model fusion: A survey.arXiv preprint arXiv:2309.15698,

Deep model fusion: A survey , author=. arXiv preprint arXiv:2309.15698 , year=

work page arXiv

[59] [59]

ACM Computing Surveys , year=

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities , author=. ACM Computing Surveys , year=

[60] [60]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Securing Distributed Gradient Descent in High Dimensional Statistical Learning

Securing distributed machine learning in high dimensions , author=. arXiv preprint arXiv:1804.10140 , volume=. 2018 , publisher=

work page internal anchor Pith review Pith/arXiv arXiv 2018

[62] [62]

arXiv preprint arXiv:2410.13638 , year=

Scaling wearable foundation models , author=. arXiv preprint arXiv:2410.13638 , year=

work page arXiv

[63] [63]

Advances in Neural Information Processing Systems , volume=

Federated learning from vision-language foundation models: Theoretical analysis and method , author=. Advances in Neural Information Processing Systems , volume=

[64] [64]

Medical image analysis , volume=

On the challenges and perspectives of foundation models for medical image analysis , author=. Medical image analysis , volume=. 2024 , publisher=

2024

[65] [65]

International Conference on Blockchain and Trustworthy Systems , pages=

Vision foundation models in medical image analysis: Advances and challenges , author=. International Conference on Blockchain and Trustworthy Systems , pages=. 2025 , organization=

2025

[66] [66]

2018 International Conference on Computational Science and Computational Intelligence (CSCI) , pages=

Deep learning at the edge , author=. 2018 International Conference on Computational Science and Computational Intelligence (CSCI) , pages=. 2018 , organization=

2018