arxiv: 2511.17378 · v2 · submitted 2025-11-21 · 💻 cs.LG

A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias

Wei-Kai Chang , Rajiv Khanna This is my paper

Pith reviewed 2026-05-17 20:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords linear stabilitySAMSGDsimplicity biasdata coherenceReLU networksoptimization dynamicsgeneralization

0 comments

The pith

A coherence measure of aligned gradient curvatures across data points explains why SAM and SGD select simpler minima over complex ones in two-layer ReLU networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a linear stability framework to compare how SGD, random perturbations, and SAM behave during training. It centers on a coherence measure that checks whether the curvature of gradients points in similar directions from one training example to the next. When coherence is high, a minimum resists small changes in the update rule and therefore survives longer in the optimization trajectory. This stability difference accounts for the observed preference for flatter and simpler solutions. The same lens also accounts for why SAM, which explicitly searches for worst-case points inside a small ball, further tilts the selection toward those coherent minima.

Core claim

In two-layer ReLU networks the linear stability of a candidate minimum under SGD or SAM updates is governed by a coherence quantity that measures how consistently the gradient curvature aligns from one data point to another. Minima with higher coherence remain stable against the stochastic or adversarial perturbations introduced by the optimizer and are therefore the ones that training converges to, producing the simplicity bias.

What carries the argument

The coherence measure, which quantifies the alignment of gradient curvature across data points and thereby sets the linear stability threshold for each candidate minimum.

If this is right

Minima whose gradient curvatures point in similar directions across examples resist the noise in SGD steps and therefore persist.
SAM enlarges the set of directions considered at each step and therefore selects even higher-coherence minima than plain SGD.
The simplicity bias observed in practice follows directly because simpler functions tend to produce more coherent curvature patterns on typical data.
Random perturbations without the SAM neighborhood produce intermediate stability behavior between plain SGD and full SAM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the coherence measure can be computed cheaply during training, it could serve as a diagnostic for when an optimizer is about to favor overly complex solutions.
The same stability calculation might be adapted to other first-order methods such as Adam by replacing the perturbation model with the appropriate noise structure.
Empirical checks on deeper networks could test whether coherence still predicts which basins are reached once the two-layer assumption is relaxed.

Load-bearing premise

The linear stability analysis performed on two-layer ReLU networks is assumed to capture the essential mechanisms that drive generalization and simplicity bias in deeper networks and realistic data distributions.

What would settle it

Training a two-layer ReLU network and finding that a minimum with clearly misaligned gradient curvatures is nevertheless reached more often than a high-coherence minimum under either SGD or SAM would falsify the central stability claim.

Figures

Figures reproduced from arXiv: 2511.17378 by Rajiv Khanna, Wei-Kai Chang.

**Figure 1.** Figure 1: Comparison of optimization dynamics across different methods and configurations. (a) SAM’s dyanamics over different hyper-parameter settings (Red:diverging Blue:converging). (b) Boundary comparison: SGD, random perturbation, SAM. SGD and random perturbation boundaries largely overlap, while SAM diverges in more combination of batch size and σ. (c) SAM boundaries at different ρ α : Higher ρ α further tight… view at source ↗

**Figure 2.** Figure 2: 2-layer ReLU network. SAM imposes strong regularization on the maximum elementwise Hessian eigenvalue, and this also reduces the largest eigenvalue of the coherence matrix, which implies the stability condition is satisfied with smaller σ. Conclusion. Our analysis reveals that the stability properties of optimization algorithms—especially in the presence of data coherence—are central to the emergence of g… view at source ↗

**Figure 3.** Figure 3: 2-layer ReLU network. We found that the SAM method can impose strong regulation on the maximum eigenvalue elementwise, and this also reduce the strengthen of the largest eigenvalue of the coherence matrix. It means that the stability condition can be satisfied with smaller σ. From our experiments, we find that the sharpness of the solution impose strong regulation of the eigenvalue of the coherence matrix … view at source ↗

**Figure 4.** Figure 4: 2-layer ReLU network. (Left) Comparison of SGD and SAM with different ρ. (Middle) We perform the same set of experiment with increased learning rate from 0.1 to 0.3. (Black to Red) (Right) SGD with different contrast loss strengthen (0.0, 0.1, 0.01). Through out the experiments, we find uniform shifting behavior for different algorithm with different strength but the relationship between maxi λmax(Hi) and … view at source ↗

read the original abstract

Understanding the dynamics of optimization in deep learning is increasingly important as models scale. While stochastic gradient descent (SGD) and its variants reliably find solutions that generalize well, the mechanisms driving this generalization remain unclear. Notably, these algorithms often prefer flatter or simpler minima, particularly in overparameterized settings. Prior work has linked flatness to generalization, and methods like Sharpness-Aware Minimization (SAM) explicitly encourage flatness, but a unified theory connecting data structure, optimization dynamics, and the nature of learned solutions is still lacking. In this work, we develop a linear stability framework that analyzes the behavior of SGD, random perturbations, and SAM, particularly in two layer ReLU networks. Central to our analysis is a coherence measure that quantifies how gradient curvature aligns across data points, revealing why certain minima are stable and favored during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a coherence measure for gradient curvature alignment and applies it in a linear stability framework to compare SGD and SAM in two-layer ReLU networks.

read the letter

The main takeaway is that they define a coherence measure tracking how gradient curvatures align across data points and use it inside a linear stability analysis to explain why SGD, random perturbations, and SAM settle on certain minima in two-layer ReLU networks. This gives a direct link from data structure to which solutions are stable and why simplicity bias appears. The scope stays limited to the two-layer case, which keeps the claims grounded and avoids the usual overreach to arbitrary depth. That choice lets them derive concrete stability comparisons without hand-waving. The attempt to unify the three algorithms under one coherence lens is the clearest new piece. It builds on existing flatness ideas but tries to make the role of data explicit rather than treating flatness as an isolated property. If the derivations hold, this could clarify why SAM sometimes outperforms plain SGD on generalization without needing separate explanations for each. The soft spots are straightforward. The abstract supplies no equations, proof outlines, or experimental checks, so it is impossible to confirm whether the stability results follow rigorously or whether coherence reduces to something already known. That leaves soundness hard to judge until the full math is examined. No over-extrapolation to deeper nets is visible, which is good, but readers will still want to see how sensitive the conclusions are to the ReLU and two-layer assumptions. This work is aimed at theorists who study optimization dynamics and generalization through stability arguments. Someone already working on mechanistic accounts of SGD or SAM would get the most out of it. It deserves a serious referee because the question is central and the framing is distinct enough to merit detailed checking of the derivations and any supporting calculations.

Referee Report

0 major / 3 minor

Summary. The manuscript develops a linear stability framework to analyze the dynamics of SGD, random perturbations, and Sharpness-Aware Minimization (SAM) in two-layer ReLU networks. Central to the analysis is a coherence measure that quantifies the alignment of gradient curvatures across data points, which is used to characterize the stability of minima and the emergence of simplicity bias during training.

Significance. If the derivations hold, the work supplies a tractable, data-dependent explanation for why SGD and SAM preferentially converge to stable, simple solutions in overparameterized regimes. By explicitly restricting the setting to two-layer ReLU networks, the authors enable concrete calculations of the coherence quantity that link data structure directly to optimization stability, providing a foundation that could be extended to deeper architectures.

minor comments (3)

The abstract and introduction state that the coherence measure 'reveals why certain minima are stable,' but the precise mapping from the coherence value to the stability threshold (e.g., the critical eigenvalue condition) is only sketched; a short paragraph in §2 or §3 clarifying this mapping would improve readability.
Figure 2 compares stability regions for SGD and SAM but does not report the number of random seeds or the variance across runs; adding error bars or a table of standard deviations would strengthen the empirical support for the theoretical predictions.
Notation for the coherence matrix C in Eq. (7) is introduced without an explicit statement of its symmetry properties; confirming that C is symmetric (or stating the conditions under which it is) would prevent potential confusion in later derivations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending minor revision. We are encouraged that the linear stability framework and the role of data coherence in explaining SGD and SAM dynamics in two-layer ReLU networks have been recognized as providing a tractable, data-dependent account of stability and simplicity bias.

read point-by-point responses

Referee: The manuscript develops a linear stability framework to analyze the dynamics of SGD, random perturbations, and Sharpness-Aware Minimization (SAM) in two-layer ReLU networks. Central to the analysis is a coherence measure that quantifies the alignment of gradient curvatures across data points, which is used to characterize the stability of minima and the emergence of simplicity bias during training.

Authors: We appreciate the referee's concise and accurate summary of the central contributions. The coherence measure is defined to capture the alignment of per-sample gradient curvatures, which directly enters the linear stability conditions derived for SGD, random perturbations, and SAM. All derivations are carried out explicitly for two-layer ReLU networks so that the coherence quantity can be computed from the data and the network weights at a candidate minimum. revision: no
Referee: If the derivations hold, the work supplies a tractable, data-dependent explanation for why SGD and SAM preferentially converge to stable, simple solutions in overparameterized regimes. By explicitly restricting the setting to two-layer ReLU networks, the authors enable concrete calculations of the coherence quantity that link data structure directly to optimization stability, providing a foundation that could be extended to deeper architectures.

Authors: We agree that the two-layer restriction is what makes the coherence calculations concrete and verifiable. The derivations rely on the piecewise-linear structure of ReLU activations and the resulting block structure of the Hessian; these steps are fully detailed in Sections 3 and 4. We believe the derivations are correct as presented and do not require modification. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper scopes its linear stability framework and coherence measure explicitly to two-layer ReLU networks. The central claim defines and applies the coherence quantity within this tractable setting to compare stability of minima under SGD, perturbations, and SAM. No equations or derivations are shown that reduce a derived quantity to a fitted coherence parameter or to prior self-citations by construction. The analysis remains self-contained against external benchmarks with no load-bearing self-citation chains or self-definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces a coherence measure and linear stability framework but provides no explicit free parameters, axioms, or invented entities. All technical definitions and assumptions remain unspecified.

pith-pipeline@v0.9.0 · 5445 in / 1138 out tokens · 37093 ms · 2026-05-17T20:11:01.767256+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Central to our analysis is a coherence measure that quantifies how gradient curvature aligns across data points... σ = λmax(S) / maxi λmax(Hi) with Sij = ||H_i^{1/2} H_j^{1/2}||_F = sqrt(Tr(Hi Hj))
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We prove that the emergence of an implicit simplicity bias... highly coherent solutions tend to be flatter... SAM amplifies the simplicity bias

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

URLhttps://arxiv.org/abs/2206.06232. M. Andriushchenko, D. Bahri, H. Mobahi, and N. Flammarion. Sharpness-aware minimization leads to low-rank features. InAdvances in Neural Information Processing Systems (NeurIPS) 36, 2023. D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. A clo...

work page arXiv 2023
[2]

URLhttps://arxiv.org/abs/1810.10118. J. Kwon, S. Yoon, C. Kim, and S. J. Hwang. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In38th International Conference on Machine Learning (ICML), pages 5905–5914, 2021. H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

URLhttps://arxiv.org/abs/1712.09913. K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. In7th International Conference on Learning Representations (ICLR), 2019. C. Ma and L. Ying. On linear stability of sgd and input-smoothness of neural networks, 2021. URL https://arxiv.org/abs/2105.13462. D. Morwani, J. Batra, P. Jai...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

NeurIPS Paper Checklist

URLhttps://arxiv.org/abs/2305.17490. L. Wu, C. Ma, et al. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective.Advances in Neural Information Processing Systems, 31, 2018. L. Wu, M. Wang, and W. Su. The alignment property of sgd noise and how it helps select flat minima: A stability analysis, 2022. URLhttps:/...

work page arXiv 2018
[5]

sharpness

pointed out that the sharpness of a minimum is not an invariant property (reparameterizations of the model can change the Hessian spectrum without affecting generalization), cautioning that one must carefully define “sharpness” (e.g., by normalizing for scale or using local subspace measures). Our work incorporates this perspective by focusing on arelativ...

work page 2021
[6]

stability

evaluate a variety of complexity measures (including some Hessian-based) to see which best predict generalization; they found that no single measure works universally, but a combination can. Our introduction of coherence could add a new dimension to such measures, since it incorporates data-dependent interactions. Linear stability.Linear stability has gai...

work page 2018
[7]

This can be seen as a form of simplicity bias (since a max-margin separator in linear space is a simpler decision boundary than a complex wiggle that also separates the data)

show that for linearly separable data and logistic loss, SGD converges to the max-L2-margin classifier. This can be seen as a form of simplicity bias (since a max-margin separator in linear space is a simpler decision boundary than a complex wiggle that also separates the data). In deep networks, Lyu and Li [2019] extended this to deep homogeneous network...

work page 2019
[8]

The condition for divergence is the same as that for SGD [Dexter et al., 2024] as follows: η≥ σ λ1 ( n b −1) − 1 2

work page 2024
[9]

(Comparative Divergence Speed) SupposeTr[J 2k]≤C 0αk for some constantsC 0 and αk, then the divergence rate of the random perturbation method is asymptotically within a constant factor of that of standard SGD: lim k→∞ E[∥wk∥2]Random, lower bound E[∥wk∥2]SGD, lower bound =O(1)

work page
[10]

Suppose the step size satisfies the convergence criterion established in prior stability anal- yses (e.g., Dexter et al. [2024]). Then, under the random perturbation update(3), the expected squared norm of the iterates remains bounded ask→ ∞: lim k→∞ E[w T k wk]upper bound =O(1) Proof.DefineH= 1 n Pn i=1 Hi. Now consider k steps after, we can have express...

work page 2024
[11]

The Trace ofM k is lower bounded throughσ: Tr[Mk]≥η 2k( n B −1) k(1 + ρ α λmin(H)) 2k 1 σ2k 1 nd5 λmax(H) 2k (30)

work page
[12]

The diverging criterion for SAM under linear stability is: λmax(H)≥ σ η ( n B −1) − 1 2 (1 + ρ α λmin(H)) −1 (31) Proof.We prove by induction as follows. Base case: k=1 E[ ˆJ1 T ˆJ1] =E[(I−ηH t(I+ ρ α )H) T (I−ηH t(I+ ρ α )H)] =E[I−2ηH t − ηρ α (HtH+HH t) +η 2H 2 t + η2ρ α (H 2 t H+HH 2 t ) + η2ρ2 α2 HH 2 t H] =I−2ηH−2 ηρ α H 2 +η 2E[H 2 t ] + η2ρ α (E[H ...

work page 2024
[13]

ˆJ T 1 ˆJ1

There existN r such that E[ ˆJ T k ... ˆJ T 1 ˆJ1... ˆJk]⪯ kX r=0 (1−ϵ) 2(k−r) k r Nr (46) and Nk =η 2k( 1 nB − 1 n2 )k nX y1,...yr=1 (I+ ρ α H)H yk ...(I+ ρ α H)H 2 y1 (I+ ρ α H)...H yk (I+ ρ α H) (47)

work page
[14]

TheN r can be upper bounded as following Tr[Nr]≤η 2k( 1 B − 1 n )kd3k+ 1 2 n4k λmax(HSAM)4k σ2k SAM (48)

work page
[15]

ˆJ T 1 ˆJ1

Suppose there existϵ∈(0,1)and we will have converging criterion such that ϵ η ≤λ i + ρ α λ2 i ≤ 2−ϵ η ∀i∈[d]and lim k→∞ 1 ϵk η2k( 1 nB − 1 n2 )k nX y1,y2...yk=1 (I+ ρ α H)H yk ...(I+ ρ α H)H 2 y1 (I+ ρ α H)...H yk (I+ ρ α H) = 0 (49) then we will have thatlim k→∞ E[ ˆJ T k ... ˆJ T 1 ˆJ1... ˆJk] = 0 Proof.We first defineN r as follows: Nk =η 2k( 1 nB − 1 ...

work page 2024