A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias
Pith reviewed 2026-05-17 20:11 UTC · model grok-4.3
The pith
A coherence measure of aligned gradient curvatures across data points explains why SAM and SGD select simpler minima over complex ones in two-layer ReLU networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In two-layer ReLU networks the linear stability of a candidate minimum under SGD or SAM updates is governed by a coherence quantity that measures how consistently the gradient curvature aligns from one data point to another. Minima with higher coherence remain stable against the stochastic or adversarial perturbations introduced by the optimizer and are therefore the ones that training converges to, producing the simplicity bias.
What carries the argument
The coherence measure, which quantifies the alignment of gradient curvature across data points and thereby sets the linear stability threshold for each candidate minimum.
If this is right
- Minima whose gradient curvatures point in similar directions across examples resist the noise in SGD steps and therefore persist.
- SAM enlarges the set of directions considered at each step and therefore selects even higher-coherence minima than plain SGD.
- The simplicity bias observed in practice follows directly because simpler functions tend to produce more coherent curvature patterns on typical data.
- Random perturbations without the SAM neighborhood produce intermediate stability behavior between plain SGD and full SAM.
Where Pith is reading between the lines
- If the coherence measure can be computed cheaply during training, it could serve as a diagnostic for when an optimizer is about to favor overly complex solutions.
- The same stability calculation might be adapted to other first-order methods such as Adam by replacing the perturbation model with the appropriate noise structure.
- Empirical checks on deeper networks could test whether coherence still predicts which basins are reached once the two-layer assumption is relaxed.
Load-bearing premise
The linear stability analysis performed on two-layer ReLU networks is assumed to capture the essential mechanisms that drive generalization and simplicity bias in deeper networks and realistic data distributions.
What would settle it
Training a two-layer ReLU network and finding that a minimum with clearly misaligned gradient curvatures is nevertheless reached more often than a high-coherence minimum under either SGD or SAM would falsify the central stability claim.
Figures
read the original abstract
Understanding the dynamics of optimization in deep learning is increasingly important as models scale. While stochastic gradient descent (SGD) and its variants reliably find solutions that generalize well, the mechanisms driving this generalization remain unclear. Notably, these algorithms often prefer flatter or simpler minima, particularly in overparameterized settings. Prior work has linked flatness to generalization, and methods like Sharpness-Aware Minimization (SAM) explicitly encourage flatness, but a unified theory connecting data structure, optimization dynamics, and the nature of learned solutions is still lacking. In this work, we develop a linear stability framework that analyzes the behavior of SGD, random perturbations, and SAM, particularly in two layer ReLU networks. Central to our analysis is a coherence measure that quantifies how gradient curvature aligns across data points, revealing why certain minima are stable and favored during training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a linear stability framework to analyze the dynamics of SGD, random perturbations, and Sharpness-Aware Minimization (SAM) in two-layer ReLU networks. Central to the analysis is a coherence measure that quantifies the alignment of gradient curvatures across data points, which is used to characterize the stability of minima and the emergence of simplicity bias during training.
Significance. If the derivations hold, the work supplies a tractable, data-dependent explanation for why SGD and SAM preferentially converge to stable, simple solutions in overparameterized regimes. By explicitly restricting the setting to two-layer ReLU networks, the authors enable concrete calculations of the coherence quantity that link data structure directly to optimization stability, providing a foundation that could be extended to deeper architectures.
minor comments (3)
- The abstract and introduction state that the coherence measure 'reveals why certain minima are stable,' but the precise mapping from the coherence value to the stability threshold (e.g., the critical eigenvalue condition) is only sketched; a short paragraph in §2 or §3 clarifying this mapping would improve readability.
- Figure 2 compares stability regions for SGD and SAM but does not report the number of random seeds or the variance across runs; adding error bars or a table of standard deviations would strengthen the empirical support for the theoretical predictions.
- Notation for the coherence matrix C in Eq. (7) is introduced without an explicit statement of its symmetry properties; confirming that C is symmetric (or stating the conditions under which it is) would prevent potential confusion in later derivations.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript and for recommending minor revision. We are encouraged that the linear stability framework and the role of data coherence in explaining SGD and SAM dynamics in two-layer ReLU networks have been recognized as providing a tractable, data-dependent account of stability and simplicity bias.
read point-by-point responses
-
Referee: The manuscript develops a linear stability framework to analyze the dynamics of SGD, random perturbations, and Sharpness-Aware Minimization (SAM) in two-layer ReLU networks. Central to the analysis is a coherence measure that quantifies the alignment of gradient curvatures across data points, which is used to characterize the stability of minima and the emergence of simplicity bias during training.
Authors: We appreciate the referee's concise and accurate summary of the central contributions. The coherence measure is defined to capture the alignment of per-sample gradient curvatures, which directly enters the linear stability conditions derived for SGD, random perturbations, and SAM. All derivations are carried out explicitly for two-layer ReLU networks so that the coherence quantity can be computed from the data and the network weights at a candidate minimum. revision: no
-
Referee: If the derivations hold, the work supplies a tractable, data-dependent explanation for why SGD and SAM preferentially converge to stable, simple solutions in overparameterized regimes. By explicitly restricting the setting to two-layer ReLU networks, the authors enable concrete calculations of the coherence quantity that link data structure directly to optimization stability, providing a foundation that could be extended to deeper architectures.
Authors: We agree that the two-layer restriction is what makes the coherence calculations concrete and verifiable. The derivations rely on the piecewise-linear structure of ReLU activations and the resulting block structure of the Hessian; these steps are fully detailed in Sections 3 and 4. We believe the derivations are correct as presented and do not require modification. revision: no
Circularity Check
No significant circularity
full rationale
The paper scopes its linear stability framework and coherence measure explicitly to two-layer ReLU networks. The central claim defines and applies the coherence quantity within this tractable setting to compare stability of minima under SGD, perturbations, and SAM. No equations or derivations are shown that reduce a derived quantity to a fitted coherence parameter or to prior self-citations by construction. The analysis remains self-contained against external benchmarks with no load-bearing self-citation chains or self-definitional steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Central to our analysis is a coherence measure that quantifies how gradient curvature aligns across data points... σ = λmax(S) / maxi λmax(Hi) with Sij = ||H_i^{1/2} H_j^{1/2}||_F = sqrt(Tr(Hi Hj))
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We prove that the emergence of an implicit simplicity bias... highly coherent solutions tend to be flatter... SAM amplifies the simplicity bias
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2206.06232. M. Andriushchenko, D. Bahri, H. Mobahi, and N. Flammarion. Sharpness-aware minimization leads to low-rank features. InAdvances in Neural Information Processing Systems (NeurIPS) 36, 2023. D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. A clo...
-
[2]
URLhttps://arxiv.org/abs/1810.10118. J. Kwon, S. Yoon, C. Kim, and S. J. Hwang. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In38th International Conference on Machine Learning (ICML), pages 5905–5914, 2021. H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets,
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
URLhttps://arxiv.org/abs/1712.09913. K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. In7th International Conference on Learning Representations (ICLR), 2019. C. Ma and L. Ying. On linear stability of sgd and input-smoothness of neural networks, 2021. URL https://arxiv.org/abs/2105.13462. D. Morwani, J. Batra, P. Jai...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
URLhttps://arxiv.org/abs/2305.17490. L. Wu, C. Ma, et al. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective.Advances in Neural Information Processing Systems, 31, 2018. L. Wu, M. Wang, and W. Su. The alignment property of sgd noise and how it helps select flat minima: A stability analysis, 2022. URLhttps:/...
-
[5]
pointed out that the sharpness of a minimum is not an invariant property (reparameterizations of the model can change the Hessian spectrum without affecting generalization), cautioning that one must carefully define “sharpness” (e.g., by normalizing for scale or using local subspace measures). Our work incorporates this perspective by focusing on arelativ...
work page 2021
-
[6]
evaluate a variety of complexity measures (including some Hessian-based) to see which best predict generalization; they found that no single measure works universally, but a combination can. Our introduction of coherence could add a new dimension to such measures, since it incorporates data-dependent interactions. Linear stability.Linear stability has gai...
work page 2018
-
[7]
show that for linearly separable data and logistic loss, SGD converges to the max-L2-margin classifier. This can be seen as a form of simplicity bias (since a max-margin separator in linear space is a simpler decision boundary than a complex wiggle that also separates the data). In deep networks, Lyu and Li [2019] extended this to deep homogeneous network...
work page 2019
-
[8]
The condition for divergence is the same as that for SGD [Dexter et al., 2024] as follows: η≥ σ λ1 ( n b −1) − 1 2
work page 2024
-
[9]
(Comparative Divergence Speed) SupposeTr[J 2k]≤C 0αk for some constantsC 0 and αk, then the divergence rate of the random perturbation method is asymptotically within a constant factor of that of standard SGD: lim k→∞ E[∥wk∥2]Random, lower bound E[∥wk∥2]SGD, lower bound =O(1)
-
[10]
Suppose the step size satisfies the convergence criterion established in prior stability anal- yses (e.g., Dexter et al. [2024]). Then, under the random perturbation update(3), the expected squared norm of the iterates remains bounded ask→ ∞: lim k→∞ E[w T k wk]upper bound =O(1) Proof.DefineH= 1 n Pn i=1 Hi. Now consider k steps after, we can have express...
work page 2024
-
[11]
The Trace ofM k is lower bounded throughσ: Tr[Mk]≥η 2k( n B −1) k(1 + ρ α λmin(H)) 2k 1 σ2k 1 nd5 λmax(H) 2k (30)
-
[12]
The diverging criterion for SAM under linear stability is: λmax(H)≥ σ η ( n B −1) − 1 2 (1 + ρ α λmin(H)) −1 (31) Proof.We prove by induction as follows. Base case: k=1 E[ ˆJ1 T ˆJ1] =E[(I−ηH t(I+ ρ α )H) T (I−ηH t(I+ ρ α )H)] =E[I−2ηH t − ηρ α (HtH+HH t) +η 2H 2 t + η2ρ α (H 2 t H+HH 2 t ) + η2ρ2 α2 HH 2 t H] =I−2ηH−2 ηρ α H 2 +η 2E[H 2 t ] + η2ρ α (E[H ...
work page 2024
-
[13]
There existN r such that E[ ˆJ T k ... ˆJ T 1 ˆJ1... ˆJk]⪯ kX r=0 (1−ϵ) 2(k−r) k r Nr (46) and Nk =η 2k( 1 nB − 1 n2 )k nX y1,...yr=1 (I+ ρ α H)H yk ...(I+ ρ α H)H 2 y1 (I+ ρ α H)...H yk (I+ ρ α H) (47)
-
[14]
TheN r can be upper bounded as following Tr[Nr]≤η 2k( 1 B − 1 n )kd3k+ 1 2 n4k λmax(HSAM)4k σ2k SAM (48)
-
[15]
Suppose there existϵ∈(0,1)and we will have converging criterion such that ϵ η ≤λ i + ρ α λ2 i ≤ 2−ϵ η ∀i∈[d]and lim k→∞ 1 ϵk η2k( 1 nB − 1 n2 )k nX y1,y2...yk=1 (I+ ρ α H)H yk ...(I+ ρ α H)H 2 y1 (I+ ρ α H)...H yk (I+ ρ α H) = 0 (49) then we will have thatlim k→∞ E[ ˆJ T k ... ˆJ T 1 ˆJ1... ˆJk] = 0 Proof.We first defineN r as follows: Nk =η 2k( 1 nB − 1 ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.