How does the optimizer implicitly bias the model merging loss landscape?
Pith reviewed 2026-05-18 09:45 UTC · model grok-4.3
The pith
The effective noise scale unifies how optimizer choices affect success in merging neural network models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that a single quantity, the effective noise scale, unifies the impact of different optimizer components on the model merging loss landscape. Across architectures and datasets, merging success is a non-monotonic function of the effective noise scale, with a distinct optimum. Decomposing this quantity shows that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale and exhibit the same qualitative trend. Unlike prior work that links optimizer noise only to flatness or generalization of individual minima, this scale also affects the global loss landscape and thereby predicts when two,
What carries the argument
The effective noise scale, a quantity derived from optimizer hyperparameters that controls the geometry of the global loss landscape and thereby determines how readily independently trained models can be merged.
If this is right
- Larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation each increase the effective noise scale and improve merging performance up to the identified optimum.
- Models trained with different combinations of hyperparameters that produce the same effective noise scale exhibit comparable merging success.
- The noise scale influences relationships between separate minima in the loss landscape, allowing merge outcomes to be predicted from training settings alone.
- Both linear interpolation and task arithmetic merging follow the same non-monotonic dependence on the effective noise scale.
Where Pith is reading between the lines
- Training pipelines could be tuned explicitly to reach the optimal noise scale for downstream merging tasks rather than optimizing only for single-model accuracy.
- The same scale may help explain differences in mergeability between models trained with different optimizers or schedules.
- Controlling noise dynamically during training could steer solutions toward regions of the landscape that are easier to merge.
Load-bearing premise
That the effective noise scale computed from standard optimizer hyperparameters is the dominant and generalizable driver of merging success rather than an artifact of the specific architectures, datasets, or merging methods tested.
What would settle it
A new experiment that varies only the effective noise scale while holding architecture, data, and merging method fixed and finds that merging success does not follow the predicted non-monotonic curve with a clear peak.
Figures
read the original abstract
Model merging combines independent solutions with different capabilities into a single one while maintaining the same inference cost. Two popular approaches are linear interpolation, which simply averages multiple model weights, and task arithmetic, which combines task vectors obtained by the difference between finetuned and base models. While useful in practice, what properties make merging effective are poorly understood. This paper explores how the optimization dynamics affect the loss landscape geometry and its impact on merging success. We show that a single quantity -- the effective noise scale -- unifies the impact of different optimizer components on model merging. Across architectures and datasets, merging success is a non-monotonic function of the effective noise scale, with a distinct optimum. Decomposing this quantity, we find that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale and exhibit the same qualitative trend. Unlike prior work connecting optimizer noise to the flatness or generalization of individual minima, we show that it also affects the global loss landscape, predicting when independently trained solutions can be successfully merged. Our findings broaden the understanding of how optimization shapes the loss landscape geometry and its consequences for model merging, suggesting that training dynamics could be further manipulated to improve model merging.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a single quantity—the effective noise scale, assembled from standard optimizer terms including learning rate, batch size, weight decay, and data augmentation—unifies the effects of different optimizer components on model merging success. Across architectures and datasets, merging performance via linear interpolation and task arithmetic is reported as a non-monotonic function of this scale, with a distinct optimum. The authors argue that this scale shapes the global loss landscape geometry between independently trained solutions (distinct from its known influence on the flatness of individual minima), thereby predicting when merging will succeed.
Significance. If the central empirical trends and the global-landscape attribution hold after isolating local curvature effects, the result would meaningfully extend the literature on optimizer noise by linking it to inter-model compatibility in merging. The decomposition showing consistent qualitative trends for each component (LR, batch size, etc.) and the cross-architecture/dataset consistency are strengths. However, the current evidence does not yet firmly separate global barrier effects from local flatness, which limits the strength of the unification claim.
major comments (2)
- [Abstract] Abstract: the claim that optimizer noise 'also affects the global loss landscape, predicting when independently trained solutions can be successfully merged' (distinct from prior flatness work) is load-bearing. Because linear merging success is already known to correlate with flatter individual minima and noise modulates sharpness, the non-monotonic merging trend must be shown to arise from changes in inter-minima barrier heights rather than per-minimum curvature (e.g., Hessian trace or sharpness metrics). Explicit measurements of loss along merging paths or barrier heights as a function of effective noise scale, with controls for local flatness, are required to support the global attribution.
- [Results] Empirical sections (results on non-monotonic trends): the reported consistency of the optimum across architectures and datasets is promising, but the manuscript must include quantitative controls demonstrating that the effective noise scale explains variance in merging success beyond what is captured by standard flatness measures of the individual solutions. Without such isolation, the unification interpretation remains at risk of being an artifact of local geometry.
minor comments (3)
- [Methods] Methods: provide the precise formula and hyperparameter values used to compute the effective noise scale so that the quantity can be reproduced exactly from the listed optimizer settings.
- [Figures] Figures/tables: ensure error bars or statistical tests accompany the merging-success curves to allow assessment of the reliability of the reported optimum.
- [Introduction] Introduction: add a brief comparison to prior work that already links flatness to merging success, to clarify the incremental contribution of the global-landscape argument.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The points raised about isolating global barrier effects from local flatness are well-taken and will strengthen the manuscript. We address each major comment below and will incorporate the suggested analyses in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that optimizer noise 'also affects the global loss landscape, predicting when independently trained solutions can be successfully merged' (distinct from prior flatness work) is load-bearing. Because linear merging success is already known to correlate with flatter individual minima and noise modulates sharpness, the non-monotonic merging trend must be shown to arise from changes in inter-minima barrier heights rather than per-minimum curvature (e.g., Hessian trace or sharpness metrics). Explicit measurements of loss along merging paths or barrier heights as a function of effective noise scale, with controls for local flatness, are required to support the global attribution.
Authors: We agree that direct evidence separating inter-minima barrier heights from per-minimum curvature is important to substantiate the global-landscape claim. The non-monotonic merging trend we report is difficult to explain solely via local flatness, as prior work typically links higher noise to monotonically flatter minima without the observed optimum in merging performance. Nevertheless, to address the concern rigorously, we will add in the revision: (i) loss curves along linear interpolation and task-arithmetic paths for models trained at different effective noise scales, and (ii) controls that report Hessian-trace / sharpness of the individual solutions alongside merging success. These additions will quantify barrier heights while holding local curvature fixed. revision: yes
-
Referee: [Results] Empirical sections (results on non-monotonic trends): the reported consistency of the optimum across architectures and datasets is promising, but the manuscript must include quantitative controls demonstrating that the effective noise scale explains variance in merging success beyond what is captured by standard flatness measures of the individual solutions. Without such isolation, the unification interpretation remains at risk of being an artifact of local geometry.
Authors: We concur that quantitative isolation is needed to rule out local-geometry artifacts. Our current decomposition already shows that each optimizer component (LR, batch size, weight decay, augmentation) produces the same non-monotonic merging pattern despite their differing effects on local sharpness; this consistency across components is hard to attribute only to flatness. To make the argument tighter, we will add in the revision regression or partial-correlation analyses demonstrating that effective noise scale retains significant predictive power for merging success after conditioning on standard flatness metrics (Hessian trace, sharpness). These controls will be reported in the main results section. revision: yes
Circularity Check
Effective noise scale assembled from standard hyperparameters; merging trends presented as empirical observations
full rationale
The paper assembles the effective noise scale directly from conventional optimizer terms (learning rate, batch size, weight decay, augmentation) and reports that merging success follows a non-monotonic empirical trend with this quantity across tested architectures and datasets. No derivation step reduces a claimed prediction or first-principles result back to the same fitted quantity by construction, nor does any load-bearing premise rest on a self-citation chain that itself lacks independent verification. The central unification is therefore an observational pattern rather than a self-referential loop, keeping the analysis self-contained against external benchmarks of optimizer noise.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Effective noise scale is the appropriate single quantity that captures the combined influence of learning rate, weight decay, batch size, and data augmentation on merging.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a single quantity – the effective noise scale – unifies the impact of optimizer and data choices on model merging... merging success is a non-monotonic function of effective noise, with a distinct optimum
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Maksym Andriushchenko, Francesco Croce, Maximilian M¨uller, Matthias Hein, and Nicolas Flam- marion. A modern look at the relationship between sharpness and generalization. InInternational Conference on Machine Learning, 2023a. Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Sgd with large step sizes learns spar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,
-
[3]
Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,
Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,
-
[4]
Data augmentation instead of explicit regularization.arXiv preprint arXiv:1806.03852,
Alex Hern´andez-Garc´ıa and Peter K¨onig. Data augmentation instead of explicit regularization.arXiv preprint arXiv:1806.03852,
-
[5]
Three Factors Influencing Minima in SGD
Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
An Empirical Model of Large-Batch Training
Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Razvan Pascanu, Clare Lyle, Ionut-Vlad Modoranu, Naima Elosegui Borras, Dan Alistarh, Petar Velickovic, Sarath Chandar, Soham De, and James Martens. Optimizers qualitatively alter solu- tions and we should leverage this.arXiv preprint arXiv:2507.12224,
-
[9]
Do CIFAR-10 Classifiers Generalize to CIFAR-10?
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do cifar-10 classifiers generalize to cifar-10?arXiv preprint arXiv:1806.00451,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
L2 Regularization versus Batch and Weight Normalization
Twan Van Laarhoven. L2 regularization versus batch and weight normalization.arXiv preprint arXiv:1706.05350,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
What matters for model merging at scale? arXiv preprint arXiv:2410.03617,
Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, and Tsendsuren Munkhdalai. What matters for model merging at scale?arXiv preprint arXiv:2410.03617,
-
[12]
We use the warmup-stable-decay (WSD) scheduler (Zhai et al., 2022; Hu et al., 2024)
12 Preprint A DETAILED EXPERIMENT SETTING A.1 TRAINING AND MERGING SETUP For Section 3.2, Section 3.3, Section 3.4, and Section 3.5, we use the following training setup. We use the warmup-stable-decay (WSD) scheduler (Zhai et al., 2022; Hu et al., 2024). We use the square root decay as in H ¨agele et al. (2024). Given a single configuration (e.g. lr= 0.1)...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.