pith. machine review for the scientific record. sign in

arxiv: 2510.04686 · v2 · submitted 2025-10-06 · 💻 cs.LG · cs.AI

How does the optimizer implicitly bias the model merging loss landscape?

Pith reviewed 2026-05-18 09:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords model mergingeffective noise scaleloss landscapeoptimizer hyperparameterstask arithmeticweight averagingtraining dynamicshyperparameter effects
0
0 comments X

The pith

The effective noise scale unifies how optimizer choices affect success in merging neural network models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that optimizer hyperparameters shape model merging success through one underlying quantity called the effective noise scale. This scale determines the geometry of the loss landscape between independently trained solutions, rather than only the properties of each solution on its own. Merging performance, using either weight averaging or task vectors, rises and then declines as the scale increases, reaching a maximum at an intermediate value. The same pattern appears when varying learning rate, weight decay, batch size, or data augmentation, each of which modulates the scale in a consistent direction. This framing explains why some training setups produce models that combine cleanly while others do not.

Core claim

The authors demonstrate that a single quantity, the effective noise scale, unifies the impact of different optimizer components on the model merging loss landscape. Across architectures and datasets, merging success is a non-monotonic function of the effective noise scale, with a distinct optimum. Decomposing this quantity shows that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale and exhibit the same qualitative trend. Unlike prior work that links optimizer noise only to flatness or generalization of individual minima, this scale also affects the global loss landscape and thereby predicts when two,

What carries the argument

The effective noise scale, a quantity derived from optimizer hyperparameters that controls the geometry of the global loss landscape and thereby determines how readily independently trained models can be merged.

If this is right

  • Larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation each increase the effective noise scale and improve merging performance up to the identified optimum.
  • Models trained with different combinations of hyperparameters that produce the same effective noise scale exhibit comparable merging success.
  • The noise scale influences relationships between separate minima in the loss landscape, allowing merge outcomes to be predicted from training settings alone.
  • Both linear interpolation and task arithmetic merging follow the same non-monotonic dependence on the effective noise scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines could be tuned explicitly to reach the optimal noise scale for downstream merging tasks rather than optimizing only for single-model accuracy.
  • The same scale may help explain differences in mergeability between models trained with different optimizers or schedules.
  • Controlling noise dynamically during training could steer solutions toward regions of the landscape that are easier to merge.

Load-bearing premise

That the effective noise scale computed from standard optimizer hyperparameters is the dominant and generalizable driver of merging success rather than an artifact of the specific architectures, datasets, or merging methods tested.

What would settle it

A new experiment that varies only the effective noise scale while holding architecture, data, and merging method fixed and finds that merging success does not follow the predicted non-monotonic curve with a clear peak.

Figures

Figures reproduced from arXiv: 2510.04686 by Alexander Theus, Antonio Orvieto, Chenxiang Zhang, Damien Teney, Jun Pang, Sjouke Mauw.

Figure 1
Figure 1. Figure 1: Effective noise scale controls the effectiveness of merging. The y-axis reports the test [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Larger learning rate leads to more effective merging. (top) The test accuracy gain of all the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Weight decay has a similar effect as the learning rate. For CIFAR100 and TinyImagenet, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Batch size and data augmentation control the noise during the optimization dynamics. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Larger learning rate and weight decay enable [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Merging effectiveness in transfer learning for ID and OOD data. (top) Accuracy gain linearly correlates with learning rate measured via Pearson correlation r = 0.981. (bottom) How￾ever, a larger learning rate leads to a suboptimal merged model, despite hav￾ing the largest accuracy gain. In the previous sections, we analyzed settings where mod￾els are trained from scratch. Now we consider setups in￾volving … view at source ↗
Figure 7
Figure 7. Figure 7: Task arithmetic loss landscape drastically changes depending on the initialization model. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Task arithmetic merging of models trained on two different tasks. (left) The merged models [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Larger learning rate / larger weight decay / smaller batch size all lead to a larger perfor [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy gain and data augmentation. The merging fails w/o augmentation. However, a [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Larger learning rate enables easier merging under transfer learning for both ID and OOD [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Larger learning rate enables easier merging under transfer learning for both ID and OOD [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Larger learning rate and weight decay enable more effective merging in language mod [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Training loss of decayed models from Section [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Too large learning rate causes instability/failure in merging. [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Too large weight decay causes instability/failure in merging. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Task arithmetic interpolation robustness of models w/o Pretrained weight from the Sec [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Task arithmetic interpolation robustness of models w/o Pretrained weight from the Sec [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Task arithmetic loss gain in language modeling for a small GPT on the TinyStories dataset [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Task arithmetic robustness and gain for CLIP ViT-B/16 finetuned on FMoW. [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Task arithmetic robustness and gain for ViT-S/16 pretrained on IN1k finetuned on FMoW. [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Task arithmetic robustness and gain for CLIP ViT-B/16 finetuned on RESISC45. [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Task arithmetic merging of two different tasks across [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: The loss geometry of the linear interpolation between two endpoints changes from a [PITH_FULL_IMAGE:figures/full_fig_p024_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Identifying the transition phase from hill to valley. [PITH_FULL_IMAGE:figures/full_fig_p024_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: The flatness measured using the top-8 eigenvalues of the hessian. The larger learning [PITH_FULL_IMAGE:figures/full_fig_p025_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: 2D loss slices in the plane spanned by the base model and two fine-tuned models under [PITH_FULL_IMAGE:figures/full_fig_p025_27.png] view at source ↗
read the original abstract

Model merging combines independent solutions with different capabilities into a single one while maintaining the same inference cost. Two popular approaches are linear interpolation, which simply averages multiple model weights, and task arithmetic, which combines task vectors obtained by the difference between finetuned and base models. While useful in practice, what properties make merging effective are poorly understood. This paper explores how the optimization dynamics affect the loss landscape geometry and its impact on merging success. We show that a single quantity -- the effective noise scale -- unifies the impact of different optimizer components on model merging. Across architectures and datasets, merging success is a non-monotonic function of the effective noise scale, with a distinct optimum. Decomposing this quantity, we find that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale and exhibit the same qualitative trend. Unlike prior work connecting optimizer noise to the flatness or generalization of individual minima, we show that it also affects the global loss landscape, predicting when independently trained solutions can be successfully merged. Our findings broaden the understanding of how optimization shapes the loss landscape geometry and its consequences for model merging, suggesting that training dynamics could be further manipulated to improve model merging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that a single quantity—the effective noise scale, assembled from standard optimizer terms including learning rate, batch size, weight decay, and data augmentation—unifies the effects of different optimizer components on model merging success. Across architectures and datasets, merging performance via linear interpolation and task arithmetic is reported as a non-monotonic function of this scale, with a distinct optimum. The authors argue that this scale shapes the global loss landscape geometry between independently trained solutions (distinct from its known influence on the flatness of individual minima), thereby predicting when merging will succeed.

Significance. If the central empirical trends and the global-landscape attribution hold after isolating local curvature effects, the result would meaningfully extend the literature on optimizer noise by linking it to inter-model compatibility in merging. The decomposition showing consistent qualitative trends for each component (LR, batch size, etc.) and the cross-architecture/dataset consistency are strengths. However, the current evidence does not yet firmly separate global barrier effects from local flatness, which limits the strength of the unification claim.

major comments (2)
  1. [Abstract] Abstract: the claim that optimizer noise 'also affects the global loss landscape, predicting when independently trained solutions can be successfully merged' (distinct from prior flatness work) is load-bearing. Because linear merging success is already known to correlate with flatter individual minima and noise modulates sharpness, the non-monotonic merging trend must be shown to arise from changes in inter-minima barrier heights rather than per-minimum curvature (e.g., Hessian trace or sharpness metrics). Explicit measurements of loss along merging paths or barrier heights as a function of effective noise scale, with controls for local flatness, are required to support the global attribution.
  2. [Results] Empirical sections (results on non-monotonic trends): the reported consistency of the optimum across architectures and datasets is promising, but the manuscript must include quantitative controls demonstrating that the effective noise scale explains variance in merging success beyond what is captured by standard flatness measures of the individual solutions. Without such isolation, the unification interpretation remains at risk of being an artifact of local geometry.
minor comments (3)
  1. [Methods] Methods: provide the precise formula and hyperparameter values used to compute the effective noise scale so that the quantity can be reproduced exactly from the listed optimizer settings.
  2. [Figures] Figures/tables: ensure error bars or statistical tests accompany the merging-success curves to allow assessment of the reliability of the reported optimum.
  3. [Introduction] Introduction: add a brief comparison to prior work that already links flatness to merging success, to clarify the incremental contribution of the global-landscape argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The points raised about isolating global barrier effects from local flatness are well-taken and will strengthen the manuscript. We address each major comment below and will incorporate the suggested analyses in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that optimizer noise 'also affects the global loss landscape, predicting when independently trained solutions can be successfully merged' (distinct from prior flatness work) is load-bearing. Because linear merging success is already known to correlate with flatter individual minima and noise modulates sharpness, the non-monotonic merging trend must be shown to arise from changes in inter-minima barrier heights rather than per-minimum curvature (e.g., Hessian trace or sharpness metrics). Explicit measurements of loss along merging paths or barrier heights as a function of effective noise scale, with controls for local flatness, are required to support the global attribution.

    Authors: We agree that direct evidence separating inter-minima barrier heights from per-minimum curvature is important to substantiate the global-landscape claim. The non-monotonic merging trend we report is difficult to explain solely via local flatness, as prior work typically links higher noise to monotonically flatter minima without the observed optimum in merging performance. Nevertheless, to address the concern rigorously, we will add in the revision: (i) loss curves along linear interpolation and task-arithmetic paths for models trained at different effective noise scales, and (ii) controls that report Hessian-trace / sharpness of the individual solutions alongside merging success. These additions will quantify barrier heights while holding local curvature fixed. revision: yes

  2. Referee: [Results] Empirical sections (results on non-monotonic trends): the reported consistency of the optimum across architectures and datasets is promising, but the manuscript must include quantitative controls demonstrating that the effective noise scale explains variance in merging success beyond what is captured by standard flatness measures of the individual solutions. Without such isolation, the unification interpretation remains at risk of being an artifact of local geometry.

    Authors: We concur that quantitative isolation is needed to rule out local-geometry artifacts. Our current decomposition already shows that each optimizer component (LR, batch size, weight decay, augmentation) produces the same non-monotonic merging pattern despite their differing effects on local sharpness; this consistency across components is hard to attribute only to flatness. To make the argument tighter, we will add in the revision regression or partial-correlation analyses demonstrating that effective noise scale retains significant predictive power for merging success after conditioning on standard flatness metrics (Hessian trace, sharpness). These controls will be reported in the main results section. revision: yes

Circularity Check

0 steps flagged

Effective noise scale assembled from standard hyperparameters; merging trends presented as empirical observations

full rationale

The paper assembles the effective noise scale directly from conventional optimizer terms (learning rate, batch size, weight decay, augmentation) and reports that merging success follows a non-monotonic empirical trend with this quantity across tested architectures and datasets. No derivation step reduces a claimed prediction or first-principles result back to the same fitted quantity by construction, nor does any load-bearing premise rest on a self-citation chain that itself lacks independent verification. The central unification is therefore an observational pattern rather than a self-referential loop, keeping the analysis self-contained against external benchmarks of optimizer noise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that effective noise scale unifies optimizer effects; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Effective noise scale is the appropriate single quantity that captures the combined influence of learning rate, weight decay, batch size, and data augmentation on merging.
    Invoked to explain why all four factors produce the same qualitative merging trend.

pith-pipeline@v0.9.0 · 5758 in / 1212 out tokens · 31373 ms · 2026-05-18T09:45:51.120821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    Layer Normalization

    Maksym Andriushchenko, Francesco Croce, Maximilian M¨uller, Matthias Hein, and Nicolas Flam- marion. A modern look at the relationship between sharpness and generalization. InInternational Conference on Machine Learning, 2023a. Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Sgd with large step sizes learns spar...

  2. [2]

    Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,

    Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,

  3. [3]

    Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

    Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

  4. [4]

    Data augmentation instead of explicit regularization.arXiv preprint arXiv:1806.03852,

    Alex Hern´andez-Garc´ıa and Peter K¨onig. Data augmentation instead of explicit regularization.arXiv preprint arXiv:1806.03852,

  5. [5]

    Three Factors Influencing Minima in SGD

    Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623,

  6. [6]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836,

  7. [7]

    An Empirical Model of Large-Batch Training

    Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

  8. [8]

    Optimizers qualitatively alter solu- tions and we should leverage this.arXiv preprint arXiv:2507.12224,

    Razvan Pascanu, Clare Lyle, Ionut-Vlad Modoranu, Naima Elosegui Borras, Dan Alistarh, Petar Velickovic, Sarath Chandar, Soham De, and James Martens. Optimizers qualitatively alter solu- tions and we should leverage this.arXiv preprint arXiv:2507.12224,

  9. [9]

    Do CIFAR-10 Classifiers Generalize to CIFAR-10?

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do cifar-10 classifiers generalize to cifar-10?arXiv preprint arXiv:1806.00451,

  10. [10]

    L2 Regularization versus Batch and Weight Normalization

    Twan Van Laarhoven. L2 regularization versus batch and weight normalization.arXiv preprint arXiv:1706.05350,

  11. [11]

    What matters for model merging at scale? arXiv preprint arXiv:2410.03617,

    Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, and Tsendsuren Munkhdalai. What matters for model merging at scale?arXiv preprint arXiv:2410.03617,

  12. [12]

    We use the warmup-stable-decay (WSD) scheduler (Zhai et al., 2022; Hu et al., 2024)

    12 Preprint A DETAILED EXPERIMENT SETTING A.1 TRAINING AND MERGING SETUP For Section 3.2, Section 3.3, Section 3.4, and Section 3.5, we use the following training setup. We use the warmup-stable-decay (WSD) scheduler (Zhai et al., 2022; Hu et al., 2024). We use the square root decay as in H ¨agele et al. (2024). Given a single configuration (e.g. lr= 0.1)...