Robustness and Regularization in Hierarchical Re-Basin

Arne Raulf; Benedikt Franke; Florian Heinrich; Markus Lange

arxiv: 2510.09174 · v3 · pith:VMNUPXBJnew · submitted 2025-10-10 · 💻 cs.LG

Robustness and Regularization in Hierarchical Re-Basin

Benedikt Franke , Florian Heinrich , Markus Lange , Arne Raulf This is my paper

Pith reviewed 2026-05-21 20:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords Re-Basinmodel merginghierarchical mergingadversarial robustnessperturbation robustnessneural network ensemblesregularization

0 comments

The pith

Hierarchical Re-Basin merges models while building resistance to adversarial attacks and input perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a hierarchical scheme for merging trained neural networks with the Re-Basin method and shows it beats the standard flat MergeMany algorithm. The merged models gain resistance to adversarial examples and small perturbations, and this resistance grows stronger when more base models participate in the hierarchy. The same procedure produces a larger drop in clean-data performance than earlier Re-Basin work reported.

Core claim

Re-Basin applied through a hierarchical merging procedure induces adversarial and perturbation robustness into the resulting models, with the robustness effect becoming stronger the more models participate in the hierarchy. The hierarchical algorithm also delivers better merged-model performance than the flat MergeMany baseline, although the accuracy cost on standard tasks is larger than previously observed.

What carries the argument

The hierarchical merging scheme, which applies Re-Basin recursively to successive groups of models in a tree structure instead of merging all models at once.

If this is right

Merged models gain resistance to adversarial perturbations that scales with the number of base models.
The hierarchical scheme produces stronger overall merged performance than flat merging.
Clean-data accuracy falls more than earlier Re-Basin reports indicated.
Robustness benefits appear consistently across the tested merging depths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hierarchy may act as a built-in regularizer that trades some accuracy for robustness.
Similar robustness patterns could be tested by applying hierarchy to other merging algorithms.
Deployment settings that value robustness over peak accuracy might benefit from deeper hierarchies.

Load-bearing premise

The observed robustness gains and performance drop arise from the hierarchical merging procedure itself rather than from the particular models, training runs, or evaluation protocols used.

What would settle it

Run the same set of base models through both flat Re-Basin and the hierarchical version, then measure whether the hierarchical version still shows higher adversarial and perturbation accuracy.

Figures

Figures reproduced from arXiv: 2510.09174 by Arne Raulf, Benedikt Franke, Florian Heinrich, Markus Lange.

**Figure 2.** Figure 2: Our proposed hierarchical Merging Scheme, exemplified for merging eight models. While [5] provides the MergeMany algorithm to apply Git Re-Basin to more than 2 models, we found the algorithm to have an important theoretical weakness: In each round of the algorithm, one of the n input models Θi is permuted towards the mean Θ of ¯ the other n − 1 models with Θ = ¯ 1 n P j∈{1,...n}\i Θj [5]. However, the m… view at source ↗

**Figure 3.** Figure 3: Distribution of test set accuracies on CIFAR-10 by merging algorithm [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Mean Accuracy of different Re-Basin stages over attack strength [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of Different Re-Basin Stages on Weight Norm and Lipschitz [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

This paper takes a closer look at Git Re-Basin, an interesting new approach to merge trained models. We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm. With our new algorithm, we find that Re-Basin induces adversarial and perturbation robustness into the merged models, with the effect becoming stronger the more models participate in the hierarchical merging scheme. However, in our experiments Re-Basin induces a much bigger performance drop than reported by the original authors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hierarchical Re-Basin produces scaling robustness but the larger performance drop and missing controls leave open whether the gains are truly from the merging procedure.

read the letter

The main thing here is that the hierarchical merging scheme beats standard MergeMany and appears to build adversarial and perturbation robustness that grows with more models, though it also triggers a bigger accuracy drop than the original Re-Basin paper reported. The new element is the hierarchical structure itself, which they position as a practical way to get robustness as a side effect of merging. They are straightforward about the performance cost, which is useful to see. The scaling observation is the part that could matter for people trying to make merged models more reliable without extra training. On the downside, the abstract gives almost no experimental details, so it is hard to judge whether the robustness really traces to the hierarchical steps or to how the base models were chosen and trained. Without ablations that hold the exact set of models fixed while changing only the merging depth, the scaling effect could be driven by model selection rather than the procedure. The stress-test concern about that point still looks live based on what is shown. This paper is aimed at the small group already working on Re-Basin and model merging. A reader in that niche might pick up the hierarchical idea and the robustness observation as something to test further, but it is not yet strong enough to shift how most people merge models. I would send it to peer review so the authors can supply the controls, baselines, and statistical checks that are currently absent.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a hierarchical variant of the Git Re-Basin model-merging procedure that is claimed to outperform the standard MergeMany baseline. It reports that Re-Basin merging induces adversarial and perturbation robustness, with the magnitude of this robustness increasing as more models participate in the hierarchy, while also documenting a substantially larger clean-data performance drop than was reported in the original Re-Basin work.

Significance. If the robustness scaling is shown to be causally attributable to the hierarchical alignment-and-merge steps rather than to model-selection or training artifacts, the result would be of moderate interest to the model-merging community: it would identify a new, training-free regularization pathway whose strength can be tuned by hierarchy depth. The larger performance drop observation is also potentially useful for understanding the robustness-accuracy trade-off in merging, provided it is placed in context with prior baselines.

major comments (2)

[§4 Experiments] §4 Experiments (and associated tables): the central claim that robustness strengthens with hierarchy depth requires an ablation in which the exact set of base models and training seeds is held fixed while only the merging depth is varied. No such controlled comparison is described; without it the scaling effect cannot be confidently attributed to the hierarchical Re-Basin procedure itself rather than to incidental differences in the participating models.
[§4.2] §4.2 and Table 3: the reported adversarial and perturbation robustness numbers are presented without error bars across independent training runs or statistical significance tests. Given that the abstract already notes a larger performance drop than prior work, the absence of these controls makes it impossible to judge whether the robustness gains are reliable or simply correlated with the larger accuracy degradation.

minor comments (2)

The abstract states empirical findings without any reference to experimental setup, number of models, datasets, or evaluation protocols; a one-sentence summary of the experimental regime would improve readability.
[§3] Notation for the hierarchical merging levels is introduced without a clear diagram or pseudocode; a small figure illustrating the tree structure would clarify the algorithm.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the strength of our claims. We address each major point below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [§4 Experiments] §4 Experiments (and associated tables): the central claim that robustness strengthens with hierarchy depth requires an ablation in which the exact set of base models and training seeds is held fixed while only the merging depth is varied. No such controlled comparison is described; without it the scaling effect cannot be confidently attributed to the hierarchical Re-Basin procedure itself rather than to incidental differences in the participating models.

Authors: We agree that the current experimental design does not fully isolate the effect of hierarchy depth from the number of participating models. In our reported results, deeper hierarchies incorporate additional models by construction, which could introduce confounding factors. We will add a controlled ablation in the revised manuscript that fixes the exact set of base models and training seeds while varying only the merging depth (e.g., by constructing hierarchies of different depths from the same pool of models). This will allow a direct attribution of any robustness scaling to the hierarchical alignment-and-merge steps. revision: yes
Referee: [§4.2] §4.2 and Table 3: the reported adversarial and perturbation robustness numbers are presented without error bars across independent training runs or statistical significance tests. Given that the abstract already notes a larger performance drop than prior work, the absence of these controls makes it impossible to judge whether the robustness gains are reliable or simply correlated with the larger accuracy degradation.

Authors: We acknowledge that the absence of error bars and statistical tests limits the ability to assess reliability, particularly in light of the larger clean accuracy drop we report. We will rerun the primary experiments across multiple independent training seeds, report standard deviations or confidence intervals for both clean accuracy and robustness metrics, and include statistical significance tests (e.g., paired t-tests) comparing hierarchical Re-Basin against baselines. These additions will be incorporated into the revised §4.2 and Table 3. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experiments, not derivations

full rationale

The paper introduces a hierarchical variant of Re-Basin merging and reports experimental observations of induced robustness that scales with the number of models. No equations, parameter-fitting steps, or mathematical derivations are described that could reduce a claimed prediction to an input by construction. Claims are supported by direct comparisons to MergeMany and ablation-style scaling experiments, which remain externally falsifiable. This is a standard empirical ML paper whose central results do not rely on self-referential definitions or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.0 · 5598 in / 1072 out tokens · 36002 ms · 2026-05-21T20:29:50.437867+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm... Re-Basin induces adversarial and perturbation robustness... stronger the more models participate
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Re-Basin seems to act as a sort of regularization, positively impacting adversarial and perturbation robustness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle et al. Linear mode connectivity and the lottery ticket hypothesis. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, volume 119 ofProceedings of Machine Learning Research, pages 3259–3269. PMLR, 2020

work page 2020
[2]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In7th International Conference on Learning Representations, ICLR 2019. OpenReview.net, 2019

work page 2019
[3]

Linear mode connectivity in multitask and continual learning

Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Dilan G¨ or¨ ur, Razvan Pascanu, and Hassan Ghasemzadeh. Linear mode connectivity in multitask and continual learning. In9th International Conference on Learning Representations. OpenReview.net, 2021

work page 2021
[4]

arXiv preprint arXiv:2210.06671 , year =

Aditya Kumar Akash et al. Wasserstein barycenter-based model fusion and linear mode connectivity of neural networks.arXiv preprint arXiv:2210.06671, 2022

work page arXiv 2022
[5]

Ainsworth et al

Samuel K. Ainsworth et al. Git re-basin: Merging models modulo permutation sym- metries. InThe Eleventh International Conference on Learning Representations, ICLR

work page
[6]

OpenReview.net, 2023

work page 2023
[7]

The role of permutation invariance in linear mode connectivity of neural networks

Rahim Entezari et al. The role of permutation invariance in linear mode connectivity of neural networks. InThe Tenth International Conference on Learning Representations, ICLR 2022, 2022

work page 2022
[8]

Exploring mode connectivity for pre-trained language models

Yujia Qin et al. Exploring mode connectivity for pre-trained language models. InPro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pages 6726–6746. Association for Computational Linguistics, 2022

work page 2022
[9]

Going beyond linear mode connectivity: The layerwise linear feature connectivity

Zhanpeng Zhou et al. Going beyond linear mode connectivity: The layerwise linear feature connectivity. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023

work page 2023
[10]

Federated learning with matched averaging

Hongyi Wang et al. Federated learning with matched averaging. In8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020

work page 2020
[11]

Property inference attacks on fully connected neural networks us- ing permutation invariant representations

Karan Ganju et al. Property inference attacks on fully connected neural networks us- ing permutation invariant representations. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 619–633, 2018

work page 2018
[12]

Deepfool: A simple and accurate method to fool deep neural networks

Seyed-Mohsen Moosavi-Dezfooli et al. Deepfool: A simple and accurate method to fool deep neural networks. In2016 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR). IEEE, June 2016

work page 2016
[13]

Huang et al

Sandy H. Huang et al. Adversarial attacks on neural network policies. In5th Interna- tional Conference on Learning Representations, ICLR 2017, Workshop Track Proceed- ings. OpenReview.net, 2017

work page 2017
[14]

The effects of adding noise during backpropagation training on a general- ization performance.Neural Computation, 8(3):643–674, 1996

Guozhong An. The effects of adding noise during backpropagation training on a general- ization performance.Neural Computation, 8(3):643–674, 1996

work page 1996
[15]

Feature selection, l1 vs

Andrew Y Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78, 2004

work page 2004
[16]

Analytical bounds on the local lipschitz constants of relu networks.IEEE Transactions on Neural Networks and Learning Systems, 2023

Trevor Avant and Kristi A Morgansen. Analytical bounds on the local lipschitz constants of relu networks.IEEE Transactions on Neural Networks and Learning Systems, 2023

work page 2023

[1] [1]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle et al. Linear mode connectivity and the lottery ticket hypothesis. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, volume 119 ofProceedings of Machine Learning Research, pages 3259–3269. PMLR, 2020

work page 2020

[2] [2]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In7th International Conference on Learning Representations, ICLR 2019. OpenReview.net, 2019

work page 2019

[3] [3]

Linear mode connectivity in multitask and continual learning

Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Dilan G¨ or¨ ur, Razvan Pascanu, and Hassan Ghasemzadeh. Linear mode connectivity in multitask and continual learning. In9th International Conference on Learning Representations. OpenReview.net, 2021

work page 2021

[4] [4]

arXiv preprint arXiv:2210.06671 , year =

Aditya Kumar Akash et al. Wasserstein barycenter-based model fusion and linear mode connectivity of neural networks.arXiv preprint arXiv:2210.06671, 2022

work page arXiv 2022

[5] [5]

Ainsworth et al

Samuel K. Ainsworth et al. Git re-basin: Merging models modulo permutation sym- metries. InThe Eleventh International Conference on Learning Representations, ICLR

work page

[6] [6]

OpenReview.net, 2023

work page 2023

[7] [7]

The role of permutation invariance in linear mode connectivity of neural networks

Rahim Entezari et al. The role of permutation invariance in linear mode connectivity of neural networks. InThe Tenth International Conference on Learning Representations, ICLR 2022, 2022

work page 2022

[8] [8]

Exploring mode connectivity for pre-trained language models

Yujia Qin et al. Exploring mode connectivity for pre-trained language models. InPro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pages 6726–6746. Association for Computational Linguistics, 2022

work page 2022

[9] [9]

Going beyond linear mode connectivity: The layerwise linear feature connectivity

Zhanpeng Zhou et al. Going beyond linear mode connectivity: The layerwise linear feature connectivity. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023

work page 2023

[10] [10]

Federated learning with matched averaging

Hongyi Wang et al. Federated learning with matched averaging. In8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020

work page 2020

[11] [11]

Property inference attacks on fully connected neural networks us- ing permutation invariant representations

Karan Ganju et al. Property inference attacks on fully connected neural networks us- ing permutation invariant representations. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 619–633, 2018

work page 2018

[12] [12]

Deepfool: A simple and accurate method to fool deep neural networks

Seyed-Mohsen Moosavi-Dezfooli et al. Deepfool: A simple and accurate method to fool deep neural networks. In2016 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR). IEEE, June 2016

work page 2016

[13] [13]

Huang et al

Sandy H. Huang et al. Adversarial attacks on neural network policies. In5th Interna- tional Conference on Learning Representations, ICLR 2017, Workshop Track Proceed- ings. OpenReview.net, 2017

work page 2017

[14] [14]

The effects of adding noise during backpropagation training on a general- ization performance.Neural Computation, 8(3):643–674, 1996

Guozhong An. The effects of adding noise during backpropagation training on a general- ization performance.Neural Computation, 8(3):643–674, 1996

work page 1996

[15] [15]

Feature selection, l1 vs

Andrew Y Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78, 2004

work page 2004

[16] [16]

Analytical bounds on the local lipschitz constants of relu networks.IEEE Transactions on Neural Networks and Learning Systems, 2023

Trevor Avant and Kristi A Morgansen. Analytical bounds on the local lipschitz constants of relu networks.IEEE Transactions on Neural Networks and Learning Systems, 2023

work page 2023