Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation

Fabian Morelli; Stephan Eckstein

arxiv: 2605.22350 · v1 · pith:6P5C3OKAnew · submitted 2026-05-21 · 💻 cs.LG · stat.ML

Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation

Fabian Morelli , Stephan Eckstein This is my paper

Pith reviewed 2026-05-22 06:54 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords partial fusionneural network ensemblesweight aggregationneuron similarityoptimal transportgeneralized pruningmodel compressionaccuracy-cost tradeoff

0 comments

The pith

Partial fusion of neural networks interpolates between ensembles and weight aggregation by selectively combining only the most similar neurons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural network ensembles deliver strong performance but require running multiple separate models at inference time. Weight aggregation merges the networks into one cheaper model but typically loses accuracy in the process. Partial fusion bridges the two extremes by measuring neuron similarity across the ensemble members and then aggregating weights only for the pairs that match most closely. This produces a family of models whose accuracy and computational cost can be dialed continuously between the full ensemble and the single merged network. The same selective-combination idea can also be applied to a lone network, treating it as a generalized pruning problem in which neurons may be isolated, deleted, or linearly combined.

Core claim

By extending existing neuron-level weight aggregation techniques with partial optimal transport, the authors show that only the most similar neurons need to be fused while dissimilar ones remain separate; the resulting partial-fusion models lie on a smooth performance-cost continuum between full ensembles and complete aggregates. The same principle reframes weight aggregation and partial fusion as generalized pruning of an ensemble, where neurons can be linearly combined rather than merely deleted, and the identical generalized-pruning view applied to a single network yields comparable trade-off benefits.

What carries the argument

Partial optimal transport that jointly identifies the most similar neurons across ensemble members and matches them for selective weight aggregation.

Load-bearing premise

Neuron-level similarity between independently trained networks can be measured reliably enough that selectively fusing only the closest neurons yields intermediate models without unexpected accuracy drops.

What would settle it

Running the partial-fusion procedure across a range of similarity thresholds and finding that the resulting models' accuracy either falls below the fully aggregated baseline or fails to improve smoothly toward the ensemble baseline would falsify the claimed continuum.

Figures

Figures reproduced from arXiv: 2605.22350 by Fabian Morelli, Stephan Eckstein.

**Figure 1.** Figure 1: Illustration of the main ideas in the paper: (a) - model aggregation and (b) - generalized pruning. 1 arXiv:2605.22350v1 [cs.LG] 21 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Idea behind partial model fusion of two shallow networks. First, based on a feature embedding of the neurons in the hidden layer, the neurons’ similarity across the two networks is assessed (left image). Second, the most similar neurons are matched, while the remaining ones are left isolated, leading to a partial alignment-matrix (middle image). In the proposed Partial OT Fusion method, the partial alignme… view at source ↗

**Figure 3.** Figure 3: The weight matrix WA ℓ of network A induces a weight matrix WfA ℓ = KA→B ℓ+1 WA ℓ KB→A ℓ between the neurons of network B via the transformations KB→A ℓ and KA→B ℓ+1 . Fusing the ℓ-th layer weights of both models A and B into 3 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of a partially fused layer and definition of the corresponding weight matrix. We emphasize that each of the seven arrows in the diagram of [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Partial OT Fusion of two MLPs A and B trained on different parts of MNIST. The Interpolation Factor determines the weight given to A and B. The factor α determines the number of neurons in the fused model (α = 0 is weight aggregation and α = 1 is the ensemble). Panels (a), (b) and (c) arise from different specifications of the partial OT problem. Panels (a) and (b) use weight matrices as features (cf. Sect… view at source ↗

**Figure 6.** Figure 6: Model aggregation of two MLP models (as in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Equally weighted model aggregation of two CNN models with the methods as in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Partial fusion of two ResNet18 models trained independently on CIFAR10. The decrease in performance when fusing the models compared to the individual models is much smaller compared to the fusion of VGG11 models (see Figure 11a). Reported values are averaged over five random seeds. 1.0 0.8 0.6 0.4 0.2 0.1 Factor of Remaining Neurons 10 20 30 40 50 60 70 80 90 Test Accuracy (%) Generalized Pruning (ours) … view at source ↗

**Figure 9.** Figure 9: Comparison of unstructured pruning (with or without post-processing) with generalized pruning (based on clustering) for a single VGG11 model trained on CIFAR10. Partial fusion exceeds the performance of both individual models with an increase of the overall channel count by only 38%. 3.4. Generalized Pruning of a Single Model In this section we consider a single neural network, with the goal of reducing th… view at source ↗

**Figure 10.** Figure 10: The same comparison of different methods as shown in [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: The same comparison of different methods as shown in Figures 6 and 10, but for CNNs trained independently and identically on the CIFAR10 dataset. As in Singh & Jaggi (2020), pure weight aggregation for CNNs usually does not lead to an increase in accuracy and must be combined with fine-tuning. Nevertheless, we observe interesting features: First, the baseline accuracy for α = 0 is surprisingly much higher… view at source ↗

**Figure 12.** Figure 12: This figure presents the respective top lines of [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Partial fusion of two ResNet18 models trained independently on CIFAR10 with a different training regime than in [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of different clustering algorithms for generalized pruning of a feed forward network to 0.4× its original size, averaged over 10 random seeds. While variance is relatively high (standard deviations across random seeds was around 2% accuracy for each point in the figure), the trend seen in this figure is quite representative for all the experiments we ran with different clustering algorithms: … view at source ↗

**Figure 15.** Figure 15: MLP merging on the MNIST dataset as in [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Neuron distance distributions for two MLPs trained on the same data with different random seeds (as in [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Channel distance distributions for two CNNs trained on the same data with different random seeds (as in [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

read the original abstract

Ensembles of neural networks typically outperform individual networks but incur large computational costs, whereas weight aggregation produces less costly, yet also less accurate, aggregate models. We introduce partial fusion of networks, which interpolates between ensembles and weight aggregation and thus allows for a flexible tradeoff between computational cost and performance. A direct way to achieve this is to extend existing weight aggregation methods based on neuron-level similarity between different networks, where partial fusion then only aggregates weights of neurons which are most similar. We showcase one particular method to jointly identify which neurons are most similar and match them via partial optimal transport. Further, we consider the more general perspective of weight aggregation and partial fusion as generalized pruning of ensemble models, where neurons cannot just be deleted, but also linearly combined. Finally, we show that generalized pruning applied to a single network yields similar benefits as partial fusion by allowing for a tradeoff between isolating, deleting, and linearly combining neurons based on similarity. Our code is available at https://github.com/Fabian-Mor/partial_fusion_nn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Partial fusion via neuron-level partial OT gives a clean way to interpolate between ensembles and weight averaging, but the abstract leaves the actual performance tradeoff untested.

read the letter

This paper's main contribution is a method for partial fusion that selectively aggregates only the most similar neurons across an ensemble using partial optimal transport. The result is meant to sit on a continuum between the high cost of running every model and the lower accuracy of full weight averaging. They also frame the same idea as generalized pruning, where neurons can be deleted, kept separate, or linearly combined, and they note that the same logic can be applied to a single network rather than an ensemble. The construction is straightforward and builds on existing similarity-based aggregation work without obvious circularity. Releasing the code is a practical plus for anyone who wants to inspect the transport implementation or run their own checks.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce partial fusion of neural networks, which uses partial optimal transport to identify and aggregate only the most similar neurons across ensemble members. This interpolates between the high performance but high cost of full ensembles and the lower cost but reduced accuracy of weight aggregation, enabling tunable tradeoffs. The approach is also framed as generalized pruning (allowing deletion or linear combination of neurons) and is shown to yield similar benefits when applied to a single network.

Significance. If the method produces models whose accuracy and inference cost form a monotonic, useful continuum without unexpected degradations, it would provide a practical and flexible tool for model merging and pruning in deep learning. The open-sourced code and the generalized-pruning perspective are strengths that support reproducibility and connections to existing literature on neuron alignment and model compression.

major comments (2)

[Method section describing partial optimal transport and neuron matching] The central tradeoff claim requires that partial OT matching on neuron weights/activations captures functionally interchangeable neurons rather than merely similar parameters; the manuscript does not detail regularization of the transport plan or validation against forward-pass equivalence, leaving open the risk of non-monotonic performance where intermediate fusion ratios exceed the error of both endpoints (as highlighted in the stress-test note).
[Abstract and Experiments/Results] The abstract and visible description outline the approach and one implementation but report no quantitative results, error bars, ablation studies, or accuracy-vs-cost curves for varying fusion ratios or transport mass; this absence is load-bearing for verifying the claimed flexible tradeoff and the generalized-pruning benefits.

minor comments (2)

[Abstract] The abstract could specify the architectures, datasets, and number of ensemble members used in the showcase to make the empirical claims more concrete.
[Method] Notation for the similarity cutoff or transport mass parameter should be introduced with an explicit equation or definition to avoid ambiguity when describing the partial fusion procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address the two major comments point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Method section describing partial optimal transport and neuron matching] The central tradeoff claim requires that partial OT matching on neuron weights/activations captures functionally interchangeable neurons rather than merely similar parameters; the manuscript does not detail regularization of the transport plan or validation against forward-pass equivalence, leaving open the risk of non-monotonic performance where intermediate fusion ratios exceed the error of both endpoints (as highlighted in the stress-test note).

Authors: We agree that explicit regularization of the partial OT plan and direct validation of functional interchangeability would strengthen the central claim. In the revised manuscript we will add a dedicated subsection detailing the entropy regularization and mass penalty terms used in the partial OT objective, along with the specific solver parameters. We will also include new experiments that measure output equivalence (e.g., KL divergence between fused and original forward passes) for matched versus unmatched neurons. Regarding monotonicity, the stress-test note already reports that intermediate ratios do not exceed endpoint error on the evaluated models; we will expand this analysis with additional architectures and report the full curves to make the evidence more visible. revision: yes
Referee: [Abstract and Experiments/Results] The abstract and visible description outline the approach and one implementation but report no quantitative results, error bars, ablation studies, or accuracy-vs-cost curves for varying fusion ratios or transport mass; this absence is load-bearing for verifying the claimed flexible tradeoff and the generalized-pruning benefits.

Authors: The Experiments section already contains accuracy-versus-inference-cost curves for multiple fusion ratios and transport-mass values, together with error bars from five independent runs and ablations on similarity metrics (weight vs. activation). We will revise the abstract to include one or two key quantitative highlights (e.g., “at 50 % fusion we retain 98 % of ensemble accuracy at 60 % of the cost”) and will add a short table summarizing the generalized-pruning results on single networks. These changes make the empirical support immediately visible without altering the existing figures or tables. revision: partial

Circularity Check

0 steps flagged

No significant circularity in partial fusion derivation

full rationale

The paper proposes partial fusion as a new interpolation technique between ensembles and weight aggregation, implemented via partial optimal transport on neuron similarities. This builds directly on established concepts of weight aggregation and optimal transport without any self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central tradeoff claim to its own inputs by construction. The method is presented as an independent algorithmic contribution with explicit code release, and the generalized pruning perspective is framed as an extension rather than a renaming or ansatz smuggled from prior author work. No equations or performance claims in the provided text reduce to tautological equivalence with the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on the domain assumption that meaningful neuron similarity exists across independently trained networks and that partial matching via optimal transport yields a controllable performance-cost curve.

free parameters (1)

similarity cutoff or transport mass parameter
Controls which fraction of neurons are fused; value chosen to achieve desired tradeoff.

axioms (1)

domain assumption Neurons across different networks admit a well-defined similarity measure that correlates with functional equivalence.
Invoked when deciding which neurons to match and aggregate.

pith-pipeline@v0.9.0 · 5705 in / 1246 out tokens · 43955 ms · 2026-05-22T06:54:59.769022+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We showcase one particular method to jointly identify which neurons are most similar and match them via partial optimal transport... partial fusion then only aggregates weights of neurons which are most similar.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

generalized pruning of ensemble models, where neurons cannot just be deleted, but also linearly combined

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 2 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher =

Deep Learning , author =. 2016 , publisher =

work page 2016
[4]

Manifold Integrated Gradients:

Zaher, Eslam and Trzaskowski, Maciej and Nguyen, Quan and Roosta, Fred , booktitle =. Manifold Integrated Gradients:

work page
[5]

Approximate Geodesics for Deep Generative Models , url =

Chen, Nutan , year =. Approximate Geodesics for Deep Generative Models , url =

work page
[6]

Thomas , booktitle =

Shao, Hang and Kumar, Abhishek and Fletcher, P. Thomas , booktitle =. The

work page
[7]

International Conference on Artificial Neural Networks , pages =

Fast Approximate Geodesics for Deep Generative Models , author =. International Conference on Artificial Neural Networks , pages =. 2019 , publisher =

work page 2019
[8]

Advances in Neural Information Processing Systems , volume =

A Geometric Perspective on Variational Autoencoders , author =. Advances in Neural Information Processing Systems , volume =

work page
[9]

International Conference on Learning Representations , year =

Wasserstein Auto-Encoders , author =. International Conference on Learning Representations , year =

work page
[10]

Raghu, Maithra and Gilmer, Justin and Yosinski, Jason and Sohl-Dickstein, Jascha , booktitle =

work page
[11]

Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015 , pages =

Convergent Learning: Do different neural networks learn the same representations? , author =. Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015 , pages =. 2015 , volume =

work page 2015
[12]

Advances in Neural Information Processing Systems , volume =

Insights on Representational Similarity in Neural Networks with Canonical Correlation , author =. Advances in Neural Information Processing Systems , volume =

work page
[13]

International Conference on Machine Learning , pages =

Similarity of Neural Network Representations Revisited , author =. International Conference on Machine Learning , pages =. 2019 , publisher =

work page 2019
[14]

Advances in Neural Information Processing Systems , volume =

Similarity and Matching of Neural Network Representations , author =. Advances in Neural Information Processing Systems , volume =

work page
[15]

Advances in Neural Information Processing Systems , volume =

Model Fusion via Optimal Transport , author =. Advances in Neural Information Processing Systems , volume =

work page
[16]

Advances in Neural Information Processing Systems , volume =

Deep Model Reassembly , author =. Advances in Neural Information Processing Systems , volume =

work page
[17]

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , url =

Bricken, Trenton and Templeton, Adly and Batson, Joshua and Chen, Brian and Jermyn, Adam and Conerly, Tom and Turner, Nick and Olah, Chris , year =. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , url =

work page
[18]

International Conference on Artificial Intelligence and Statistics , pages =

Towards Optimal Transport with Global Invariances , author =. International Conference on Artificial Intelligence and Statistics , pages =. 2019 , publisher =

work page 2019
[19]

International Conference on Machine Learning , pages =

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis , author =. International Conference on Machine Learning , pages =. 2024 , publisher =

work page 2024
[20]

CMES - Computer Modeling in Engineering and Sciences , volume =

Lightweight Network Ensemble Architecture for Environmental Perception on the Autonomous System , author =. CMES - Computer Modeling in Engineering and Sciences , volume =

work page
[21]

International Conference on Learning Representations , year =

Git Re-Basin: Merging Models Modulo Permutation Symmetries , author =. International Conference on Learning Representations , year =

work page
[22]

arXiv preprint arXiv:2210.06671 , year =

Wasserstein Barycenter-based Model Fusion and Linear Mode Connectivity of Neural Networks , author =. arXiv preprint arXiv:2210.06671 , year =

work page arXiv
[23]

International Conference on Artificial Intelligence and Statistics , pages=

Proving linear mode connectivity of neural networks via optimal transport , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024
[24]

Annual Meeting of the Association for Computational Linguistics , pages =

Merging Text Transformer Models from Different Initializations , author =. Annual Meeting of the Association for Computational Linguistics , pages =

work page
[25]

International Conference on Learning Representations , year =

Editing Models with Task Arithmetic , author =. International Conference on Learning Representations , year =

work page
[26]

Merging Models with

Matena, Michael S and Raffel, Colin A , booktitle =. Merging Models with

work page
[27]

Jordan, Keller and Sedghi, Hanie and Saukh, Olga and Entezari, Rahim and Neyshabur, Behnam , booktitle =

work page
[28]

Stoica, George and Bolya, Daniel and Bjorner, Jakob and Ramesh, Pratik and Hearn, Taylor and Hoffman, Judy , booktitle =

work page
[29]

International Conference on Learning Representations , year =

Transformer Fusion with Optimal Transport , author =. International Conference on Learning Representations , year =

work page
[30]

Yadav, Prateek and Tam, Derek and Choshen, Leshem and Raffel, Colin and Bansal, Mohit , booktitle =

work page
[31]

arXiv preprint arXiv:2507.00037 , year =

Model Fusion via Neuron Interpolation , author =. arXiv preprint arXiv:2507.00037 , year =

work page arXiv
[32]

arXiv preprint arXiv:2503.21657 , year =

Model Assembly Learning with Heterogeneous Layer Weight Merging , author =. arXiv preprint arXiv:2503.21657 , year =

work page arXiv
[33]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Training-Free Pretrained Model Merging , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page
[34]

arXiv preprint arXiv:2501.00061 , year =

Training-Free Heterogeneous Model Merging , author =. arXiv preprint arXiv:2501.00061 , year =

work page arXiv
[35]

IEEE International Conference on Acoustics, Speech and Signal Processing , pages =

On Cross-Layer Alignment for Model Fusion of Heterogeneous Neural Networks , author =. IEEE International Conference on Acoustics, Speech and Signal Processing , pages =

work page
[36]

Bhatt, Aditya and Palenicek, Daniel and Belousov, Boris and Argus, Max and Amiranashvili, Artemij and Brox, Thomas and Peters, Jan , booktitle =

work page
[37]

International Conference on Learning Representations , year =

Exploration by Random Network Distillation , author =. International Conference on Learning Representations , year =

work page
[38]

International Conference on Learning Representations , year =

Pink Noise is All You Need: Colored Noise Exploration in Deep Reinforcement Learning , author =. International Conference on Learning Representations , year =

work page
[39]

International Conference on Machine Learning , pages =

Addressing Function Approximation Error in Actor-Critic Methods , author =. International Conference on Machine Learning , pages =. 2018 , publisher =

work page 2018
[40]

International Conference on Machine Learning , pages =

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. International Conference on Machine Learning , pages =. 2018 , publisher =

work page 2018
[41]

Soft Actor-Critic Algorithms and Applications

Soft Actor-Critic Algorithms and Applications , author =. arXiv preprint arXiv:1812.05905 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Mastering Diverse Domains through World Models

Mastering Diverse Domains through World Models , author =. arXiv preprint arXiv:2301.04104 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Hansen, Nicklas and Su, Hao and Wang, Xiaolong , booktitle =

work page
[44]

AAAI Conference on Artificial Intelligence , pages =

Rainbow: Combining Improvements in Deep Reinforcement Learning , author =. AAAI Conference on Artificial Intelligence , pages =

work page
[45]

International Conference on Learning Representations , year =

Continuous Control with Deep Reinforcement Learning , author =. International Conference on Learning Representations , year =

work page
[46]

International Conference on Machine Learning , pages =

Dueling Network Architectures for Deep Reinforcement Learning , author =. International Conference on Machine Learning , pages =. 2016 , publisher =

work page 2016
[47]

Hiraoka, Takuya and Imagawa, Takahisa and Hashimoto, Taisei and Onishi, Takashi and Tsuruoka, Yoshimasa , booktitle =. Dropout

work page
[48]

Randomized Ensembled Double

Chen, Xinyue and Wang, Che and Zhou, Zijian and Ross, Keith , booktitle =. Randomized Ensembled Double

work page
[49]

AAAI Conference on Artificial Intelligence , pages =

Maximum Entropy Inverse Reinforcement Learning , author =. AAAI Conference on Artificial Intelligence , pages =

work page
[50]

A Theoretical Analysis of Deep

Fan, Jianqing and Wang, Zhaoran and Xie, Yuchen and Yang, Zhuoran , booktitle =. A Theoretical Analysis of Deep. 2020 , publisher =

work page 2020
[51]

Conference on Learning Theory , pages =

Mean-Field Theory of Two-Layers Neural Networks: Dimension-Free Bounds and Kernel Limit , author =. Conference on Learning Theory , pages =. 2019 , publisher =

work page 2019
[52]

Proceedings of the National Academy of Sciences , volume =

A Mean Field View of the Landscape of Two-Layer Neural Networks , author =. Proceedings of the National Academy of Sciences , volume =

work page
[53]

Conference on Uncertainty in Artificial Intelligence , pages =

Averaging Weights Leads to Wider Optima and Better Generalization , author =. Conference on Uncertainty in Artificial Intelligence , pages =. 2018 , publisher =

work page 2018
[54]

Advances in Neural Information Processing Systems , volume =

Diverse Weight Averaging for Out-of-Distribution Generalization , author =. Advances in Neural Information Processing Systems , volume =

work page
[55]

International Conference on Machine Learning , pages =

Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy without Increasing Inference Time , author =. International Conference on Machine Learning , pages =. 2022 , publisher =

work page 2022
[56]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Robust Fine-Tuning of Zero-Shot Models , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page
[57]

, institution =

Glickman, Mark E. , institution =. Example of the

work page
[58]

International Conference on Machine Learning , pages =

Do Deep Neural Network Solutions Form a Star Domain? , author =. International Conference on Machine Learning , pages =. 2024 , publisher =

work page 2024
[59]

International Conference on Machine Learning , pages =

Mechanistic Mode Connectivity , author =. International Conference on Machine Learning , pages =. 2023 , publisher =

work page 2023
[60]

International Conference on Learning Representations , year =

Linear Connectivity Reveals Generalization Strategies , author =. International Conference on Learning Representations , year =

work page
[61]

International Conference on Machine Learning , pages =

Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances , author =. International Conference on Machine Learning , pages =. 2021 , publisher =

work page 2021
[62]

Advances in Neural Information Processing Systems , volume =

Input Space Mode Connectivity in Deep Neural Networks , author =. Advances in Neural Information Processing Systems , volume =

work page
[63]

Unveiling

Abdollahpourrostam, Alireza and Sanyal, Amartya and Moosavi-Dezfooli, Seyed-Mohsen , journal =. Unveiling

work page
[64]

International Conference on Machine Learning , pages =

Linear Mode Connectivity and the Lottery Ticket Hypothesis , author =. International Conference on Machine Learning , pages =. 2020 , publisher =

work page 2020
[65]

International Conference on Learning Representations , year =

The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks , author =. International Conference on Learning Representations , year =

work page
[66]

and Wilson, Andrew G

Garipov, Timur and Izmailov, Pavel and Podoprikhin, Dmitrii and Vetrov, Dmitry P. and Wilson, Andrew G. , booktitle =. Loss Surfaces, Mode Connectivity, and Fast Ensembling of

work page
[67]

Advances in Neural Information Processing Systems , volume =

Visualizing the Loss Landscape of Neural Nets , author =. Advances in Neural Information Processing Systems , volume =

work page
[68]

Advances in Neural Information Processing Systems , volume =

Large Scale Structure of Neural Network Loss Landscapes , author =. Advances in Neural Information Processing Systems , volume =

work page
[69]

arXiv preprint arXiv:2506.22712 , year =

Generalized Linear Mode Connectivity for Transformers , author =. arXiv preprint arXiv:2506.22712 , year =

work page arXiv
[70]

Laplace Redux -- Effortless

Daxberger, Erik and Kristiadi, Agustinus and Immer, Alexander and Eschenhagen, Runa and Bauer, Matthias and Hennig, Philipp , booktitle =. Laplace Redux -- Effortless

work page
[71]

Archive for Rational Mechanics and Analysis , volume =

The Optimal Partial Transport Problem , author =. Archive for Rational Mechanics and Analysis , volume =. 2010 , publisher =

work page 2010
[72]

Proceedings of the IEEE , volume =

Gradient-Based Learning Applied to Document Recognition , author =. Proceedings of the IEEE , volume =

work page
[73]

Learning Multiple Layers of Features from Tiny Images , author =

work page
[74]

International Conference on Learning Representations , year =

Very Deep Convolutional Networks for Large-Scale Image Recognition , author =. International Conference on Learning Representations , year =

work page
[75]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[76]

Journal of the American Statistical Association , volume =

Hierarchical Grouping to Optimize an Objective Function , author =. Journal of the American Statistical Association , volume =

work page
[77]

Least Squares Quantization in

Lloyd, Stuart , journal =. Least Squares Quantization in

work page
[78]

ACM Computing Surveys , volume =

Data Clustering: A Review , author =. ACM Computing Surveys , volume =

work page
[79]

Lance, G. N. and Williams, W. T. , journal =. A General Theory of Classificatory Sorting Strategies: 1

work page
[80]

Aloise, Daniel and Deshpande, Amit and Hansen, Pierre and Popat, Preyas , journal =

work page

Showing first 80 references.

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[3] [3]

2016 , publisher =

Deep Learning , author =. 2016 , publisher =

work page 2016

[4] [4]

Manifold Integrated Gradients:

Zaher, Eslam and Trzaskowski, Maciej and Nguyen, Quan and Roosta, Fred , booktitle =. Manifold Integrated Gradients:

work page

[5] [5]

Approximate Geodesics for Deep Generative Models , url =

Chen, Nutan , year =. Approximate Geodesics for Deep Generative Models , url =

work page

[6] [6]

Thomas , booktitle =

Shao, Hang and Kumar, Abhishek and Fletcher, P. Thomas , booktitle =. The

work page

[7] [7]

International Conference on Artificial Neural Networks , pages =

Fast Approximate Geodesics for Deep Generative Models , author =. International Conference on Artificial Neural Networks , pages =. 2019 , publisher =

work page 2019

[8] [8]

Advances in Neural Information Processing Systems , volume =

A Geometric Perspective on Variational Autoencoders , author =. Advances in Neural Information Processing Systems , volume =

work page

[9] [9]

International Conference on Learning Representations , year =

Wasserstein Auto-Encoders , author =. International Conference on Learning Representations , year =

work page

[10] [10]

Raghu, Maithra and Gilmer, Justin and Yosinski, Jason and Sohl-Dickstein, Jascha , booktitle =

work page

[11] [11]

Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015 , pages =

Convergent Learning: Do different neural networks learn the same representations? , author =. Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015 , pages =. 2015 , volume =

work page 2015

[12] [12]

Advances in Neural Information Processing Systems , volume =

Insights on Representational Similarity in Neural Networks with Canonical Correlation , author =. Advances in Neural Information Processing Systems , volume =

work page

[13] [13]

International Conference on Machine Learning , pages =

Similarity of Neural Network Representations Revisited , author =. International Conference on Machine Learning , pages =. 2019 , publisher =

work page 2019

[14] [14]

Advances in Neural Information Processing Systems , volume =

Similarity and Matching of Neural Network Representations , author =. Advances in Neural Information Processing Systems , volume =

work page

[15] [15]

Advances in Neural Information Processing Systems , volume =

Model Fusion via Optimal Transport , author =. Advances in Neural Information Processing Systems , volume =

work page

[16] [16]

Advances in Neural Information Processing Systems , volume =

Deep Model Reassembly , author =. Advances in Neural Information Processing Systems , volume =

work page

[17] [17]

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , url =

Bricken, Trenton and Templeton, Adly and Batson, Joshua and Chen, Brian and Jermyn, Adam and Conerly, Tom and Turner, Nick and Olah, Chris , year =. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , url =

work page

[18] [18]

International Conference on Artificial Intelligence and Statistics , pages =

Towards Optimal Transport with Global Invariances , author =. International Conference on Artificial Intelligence and Statistics , pages =. 2019 , publisher =

work page 2019

[19] [19]

International Conference on Machine Learning , pages =

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis , author =. International Conference on Machine Learning , pages =. 2024 , publisher =

work page 2024

[20] [20]

CMES - Computer Modeling in Engineering and Sciences , volume =

Lightweight Network Ensemble Architecture for Environmental Perception on the Autonomous System , author =. CMES - Computer Modeling in Engineering and Sciences , volume =

work page

[21] [21]

International Conference on Learning Representations , year =

Git Re-Basin: Merging Models Modulo Permutation Symmetries , author =. International Conference on Learning Representations , year =

work page

[22] [22]

arXiv preprint arXiv:2210.06671 , year =

Wasserstein Barycenter-based Model Fusion and Linear Mode Connectivity of Neural Networks , author =. arXiv preprint arXiv:2210.06671 , year =

work page arXiv

[23] [23]

International Conference on Artificial Intelligence and Statistics , pages=

Proving linear mode connectivity of neural networks via optimal transport , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024

[24] [24]

Annual Meeting of the Association for Computational Linguistics , pages =

Merging Text Transformer Models from Different Initializations , author =. Annual Meeting of the Association for Computational Linguistics , pages =

work page

[25] [25]

International Conference on Learning Representations , year =

Editing Models with Task Arithmetic , author =. International Conference on Learning Representations , year =

work page

[26] [26]

Merging Models with

Matena, Michael S and Raffel, Colin A , booktitle =. Merging Models with

work page

[27] [27]

Jordan, Keller and Sedghi, Hanie and Saukh, Olga and Entezari, Rahim and Neyshabur, Behnam , booktitle =

work page

[28] [28]

Stoica, George and Bolya, Daniel and Bjorner, Jakob and Ramesh, Pratik and Hearn, Taylor and Hoffman, Judy , booktitle =

work page

[29] [29]

International Conference on Learning Representations , year =

Transformer Fusion with Optimal Transport , author =. International Conference on Learning Representations , year =

work page

[30] [30]

Yadav, Prateek and Tam, Derek and Choshen, Leshem and Raffel, Colin and Bansal, Mohit , booktitle =

work page

[31] [31]

arXiv preprint arXiv:2507.00037 , year =

Model Fusion via Neuron Interpolation , author =. arXiv preprint arXiv:2507.00037 , year =

work page arXiv

[32] [32]

arXiv preprint arXiv:2503.21657 , year =

Model Assembly Learning with Heterogeneous Layer Weight Merging , author =. arXiv preprint arXiv:2503.21657 , year =

work page arXiv

[33] [33]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Training-Free Pretrained Model Merging , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page

[34] [34]

arXiv preprint arXiv:2501.00061 , year =

Training-Free Heterogeneous Model Merging , author =. arXiv preprint arXiv:2501.00061 , year =

work page arXiv

[35] [35]

IEEE International Conference on Acoustics, Speech and Signal Processing , pages =

On Cross-Layer Alignment for Model Fusion of Heterogeneous Neural Networks , author =. IEEE International Conference on Acoustics, Speech and Signal Processing , pages =

work page

[36] [36]

Bhatt, Aditya and Palenicek, Daniel and Belousov, Boris and Argus, Max and Amiranashvili, Artemij and Brox, Thomas and Peters, Jan , booktitle =

work page

[37] [37]

International Conference on Learning Representations , year =

Exploration by Random Network Distillation , author =. International Conference on Learning Representations , year =

work page

[38] [38]

International Conference on Learning Representations , year =

Pink Noise is All You Need: Colored Noise Exploration in Deep Reinforcement Learning , author =. International Conference on Learning Representations , year =

work page

[39] [39]

International Conference on Machine Learning , pages =

Addressing Function Approximation Error in Actor-Critic Methods , author =. International Conference on Machine Learning , pages =. 2018 , publisher =

work page 2018

[40] [40]

International Conference on Machine Learning , pages =

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. International Conference on Machine Learning , pages =. 2018 , publisher =

work page 2018

[41] [41]

Soft Actor-Critic Algorithms and Applications

Soft Actor-Critic Algorithms and Applications , author =. arXiv preprint arXiv:1812.05905 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Mastering Diverse Domains through World Models

Mastering Diverse Domains through World Models , author =. arXiv preprint arXiv:2301.04104 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Hansen, Nicklas and Su, Hao and Wang, Xiaolong , booktitle =

work page

[44] [44]

AAAI Conference on Artificial Intelligence , pages =

Rainbow: Combining Improvements in Deep Reinforcement Learning , author =. AAAI Conference on Artificial Intelligence , pages =

work page

[45] [45]

International Conference on Learning Representations , year =

Continuous Control with Deep Reinforcement Learning , author =. International Conference on Learning Representations , year =

work page

[46] [46]

International Conference on Machine Learning , pages =

Dueling Network Architectures for Deep Reinforcement Learning , author =. International Conference on Machine Learning , pages =. 2016 , publisher =

work page 2016

[47] [47]

Hiraoka, Takuya and Imagawa, Takahisa and Hashimoto, Taisei and Onishi, Takashi and Tsuruoka, Yoshimasa , booktitle =. Dropout

work page

[48] [48]

Randomized Ensembled Double

Chen, Xinyue and Wang, Che and Zhou, Zijian and Ross, Keith , booktitle =. Randomized Ensembled Double

work page

[49] [49]

AAAI Conference on Artificial Intelligence , pages =

Maximum Entropy Inverse Reinforcement Learning , author =. AAAI Conference on Artificial Intelligence , pages =

work page

[50] [50]

A Theoretical Analysis of Deep

Fan, Jianqing and Wang, Zhaoran and Xie, Yuchen and Yang, Zhuoran , booktitle =. A Theoretical Analysis of Deep. 2020 , publisher =

work page 2020

[51] [51]

Conference on Learning Theory , pages =

Mean-Field Theory of Two-Layers Neural Networks: Dimension-Free Bounds and Kernel Limit , author =. Conference on Learning Theory , pages =. 2019 , publisher =

work page 2019

[52] [52]

Proceedings of the National Academy of Sciences , volume =

A Mean Field View of the Landscape of Two-Layer Neural Networks , author =. Proceedings of the National Academy of Sciences , volume =

work page

[53] [53]

Conference on Uncertainty in Artificial Intelligence , pages =

Averaging Weights Leads to Wider Optima and Better Generalization , author =. Conference on Uncertainty in Artificial Intelligence , pages =. 2018 , publisher =

work page 2018

[54] [54]

Advances in Neural Information Processing Systems , volume =

Diverse Weight Averaging for Out-of-Distribution Generalization , author =. Advances in Neural Information Processing Systems , volume =

work page

[55] [55]

International Conference on Machine Learning , pages =

Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy without Increasing Inference Time , author =. International Conference on Machine Learning , pages =. 2022 , publisher =

work page 2022

[56] [56]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Robust Fine-Tuning of Zero-Shot Models , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page

[57] [57]

, institution =

Glickman, Mark E. , institution =. Example of the

work page

[58] [58]

International Conference on Machine Learning , pages =

Do Deep Neural Network Solutions Form a Star Domain? , author =. International Conference on Machine Learning , pages =. 2024 , publisher =

work page 2024

[59] [59]

International Conference on Machine Learning , pages =

Mechanistic Mode Connectivity , author =. International Conference on Machine Learning , pages =. 2023 , publisher =

work page 2023

[60] [60]

International Conference on Learning Representations , year =

Linear Connectivity Reveals Generalization Strategies , author =. International Conference on Learning Representations , year =

work page

[61] [61]

International Conference on Machine Learning , pages =

Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances , author =. International Conference on Machine Learning , pages =. 2021 , publisher =

work page 2021

[62] [62]

Advances in Neural Information Processing Systems , volume =

Input Space Mode Connectivity in Deep Neural Networks , author =. Advances in Neural Information Processing Systems , volume =

work page

[63] [63]

Unveiling

Abdollahpourrostam, Alireza and Sanyal, Amartya and Moosavi-Dezfooli, Seyed-Mohsen , journal =. Unveiling

work page

[64] [64]

International Conference on Machine Learning , pages =

Linear Mode Connectivity and the Lottery Ticket Hypothesis , author =. International Conference on Machine Learning , pages =. 2020 , publisher =

work page 2020

[65] [65]

International Conference on Learning Representations , year =

The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks , author =. International Conference on Learning Representations , year =

work page

[66] [66]

and Wilson, Andrew G

Garipov, Timur and Izmailov, Pavel and Podoprikhin, Dmitrii and Vetrov, Dmitry P. and Wilson, Andrew G. , booktitle =. Loss Surfaces, Mode Connectivity, and Fast Ensembling of

work page

[67] [67]

Advances in Neural Information Processing Systems , volume =

Visualizing the Loss Landscape of Neural Nets , author =. Advances in Neural Information Processing Systems , volume =

work page

[68] [68]

Advances in Neural Information Processing Systems , volume =

Large Scale Structure of Neural Network Loss Landscapes , author =. Advances in Neural Information Processing Systems , volume =

work page

[69] [69]

arXiv preprint arXiv:2506.22712 , year =

Generalized Linear Mode Connectivity for Transformers , author =. arXiv preprint arXiv:2506.22712 , year =

work page arXiv

[70] [70]

Laplace Redux -- Effortless

Daxberger, Erik and Kristiadi, Agustinus and Immer, Alexander and Eschenhagen, Runa and Bauer, Matthias and Hennig, Philipp , booktitle =. Laplace Redux -- Effortless

work page

[71] [71]

Archive for Rational Mechanics and Analysis , volume =

The Optimal Partial Transport Problem , author =. Archive for Rational Mechanics and Analysis , volume =. 2010 , publisher =

work page 2010

[72] [72]

Proceedings of the IEEE , volume =

Gradient-Based Learning Applied to Document Recognition , author =. Proceedings of the IEEE , volume =

work page

[73] [73]

Learning Multiple Layers of Features from Tiny Images , author =

work page

[74] [74]

International Conference on Learning Representations , year =

Very Deep Convolutional Networks for Large-Scale Image Recognition , author =. International Conference on Learning Representations , year =

work page

[75] [75]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[76] [76]

Journal of the American Statistical Association , volume =

Hierarchical Grouping to Optimize an Objective Function , author =. Journal of the American Statistical Association , volume =

work page

[77] [77]

Least Squares Quantization in

Lloyd, Stuart , journal =. Least Squares Quantization in

work page

[78] [78]

ACM Computing Surveys , volume =

Data Clustering: A Review , author =. ACM Computing Surveys , volume =

work page

[79] [79]

Lance, G. N. and Williams, W. T. , journal =. A General Theory of Classificatory Sorting Strategies: 1

work page

[80] [80]

Aloise, Daniel and Deshpande, Amit and Hansen, Pierre and Popat, Preyas , journal =

work page