Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Chaewon Moon; Chulhee Yun; Dongkuk Si

arxiv: 2603.08290 · v2 · pith:HDDDDXTOnew · submitted 2026-03-09 · 💻 cs.LG · cs.AI

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Chaewon Moon , Dongkuk Si , Chulhee Yun This is my paper

Pith reviewed 2026-05-21 11:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sharpness-aware minimizationimplicit biasdepthlinear networkssequential feature amplificationmax-margingradient descentfinite-time dynamics

0 comments

The pith

For depth-two linear diagonal networks, SAM produces initialization-dependent limits and a minor-to-major feature amplification unlike gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the implicit bias of Sharpness-Aware Minimization on L-layer linear diagonal networks for linearly separable binary classification. At depth one both ℓ∞-SAM and ℓ2-SAM recover the same ℓ2 max-margin direction as gradient descent. At depth two the picture changes: ℓ∞-SAM’s convergence direction can depend on initialization and collapse to zero or any coordinate axis, while ℓ2-SAM’s infinite-time limit still matches the ℓ1 max-margin solution yet its finite-time path first amplifies the smallest data coordinates before shifting to the largest ones. The authors trace the early amplification to the gradient-normalization term inside the ℓ2-SAM perturbation step and conclude that limit-only analyses miss essential dynamics.

Core claim

In two-layer linear diagonal networks trained on linearly separable data, ℓ∞-SAM converges to a direction that depends critically on initialization and may reach the zero vector or any standard basis vector, whereas gradient descent always aligns with the dominant coordinate; ℓ2-SAM reaches the same ℓ1 max-margin limit as gradient descent but exhibits sequential feature amplification in which minor coordinates are boosted early because the perturbation normalizes by the gradient norm, allowing major coordinates to dominate later.

What carries the argument

The gradient normalization factor inside the ℓ2-SAM perturbation, which scales the update to amplify smaller coordinates at early training stages.

If this is right

Infinite-time implicit-bias results are insufficient to describe SAM behavior once depth exceeds one.
ℓ2-SAM can produce an ordering of feature reliance that begins with the weakest coordinates and only later favors the strongest ones.
ℓ∞-SAM’s convergence direction on depth-two models is sensitive to initialization scale in ways gradient descent is not.
The normalization inside the SAM perturbation directly controls the timing of coordinate amplification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed sequential amplification may appear in non-linear networks and could affect which features are learned first in practice.
Initialization strategies for ℓ∞-SAM may need to be chosen more carefully in deeper models to avoid collapse to zero or uninformative directions.
Similar depth-dependent shifts might occur in other sharpness-aware or normalized optimizers, suggesting a broader class of finite-time biases.

Load-bearing premise

The networks are linear and diagonal and the data is linearly separable, permitting exact closed-form tracking of both limits and finite-time trajectories.

What would settle it

Train a two-layer linear diagonal network on a single-example separable dataset with ℓ∞-SAM from several different initializations and check whether the converged direction is always zero or a basis vector determined by the starting scale rather than the data coordinate with largest magnitude.

read the original abstract

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAM changes its implicit bias with depth even in this simple linear diagonal case, with l∞-SAM turning initialization-dependent and l2-SAM showing a finite-time shift from minor to major coordinates.

read the letter

The core observation is that depth L=2 flips the picture for both ℓ∞-SAM and ℓ2-SAM relative to gradient descent on these separable linear-diagonal problems. ℓ∞-SAM can head to the zero vector or any coordinate axis depending on the starting point, while ℓ2-SAM reaches the same ℓ1 max-margin limit as GD but gets there by first amplifying the smaller coordinates before the larger ones take over. That finite-time path is the main new piece; prior SAM bias results focused on the endpoint, so the sequential amplification is a genuine addition inside the model class they chose. The derivation ties the early boost directly to the normalization factor inside the SAM perturbation, which is a clean and traceable step given the per-coordinate decoupling. Synthetic runs and a couple of real-data checks line up with the predicted ordering, at least qualitatively. The modeling choice of linear diagonal networks on single or few examples keeps everything closed-form, which is why they can track the exact limits and the time-dependent shift without extra assumptions. That same choice is also the main limitation: once coordinates interact or the network becomes nonlinear, the per-feature tracking breaks and it is not obvious whether the same ordering or initialization sensitivity survives. The paper is careful to stay inside its stated scope, so the internal consistency holds, but the practical reach is narrower than the title might suggest. Readers working on optimizer bias for stylized linear models will find the explicit dynamics useful for thinking about when infinite-time analyses miss the story. It is solid enough on its own terms to warrant referee time rather than a desk reject; the finite-time versus endpoint distinction is worth checking in the full proofs and seeing how far the experiments stretch the claim.

Referee Report

1 major / 3 minor

Summary. The paper analyzes the implicit bias of Sharpness-Aware Minimization (SAM) when training L-layer linear diagonal networks on linearly separable binary classification data. For L=1, both ℓ∞-SAM and ℓ2-SAM recover the ℓ2 max-margin classifier, matching gradient descent (GD). For L=2, even on a single-example dataset, ℓ∞-SAM exhibits initialization-dependent limits that can converge to the zero vector or any standard basis vector, unlike GD which aligns with the dominant coordinate's basis vector. For ℓ2-SAM, the infinite-time limit matches the ℓ1 max-margin solution, but finite-time dynamics display sequential feature amplification (initial reliance on minor coordinates shifting to major ones), attributed to the gradient normalization factor in the perturbation step. Synthetic and real-data experiments support the findings.

Significance. If the derivations hold, the work provides a concrete example of depth-induced differences in SAM's implicit bias relative to GD and demonstrates that infinite-time analyses can miss key finite-time phenomena driven by the perturbation mechanism. The closed-form tracking enabled by the linear diagonal model on separable data strengthens the internal consistency of the claims within the scoped setting and offers a falsifiable prediction about coordinate amplification order.

major comments (1)

[§4.2] §4.2, the finite-time analysis of ℓ2-SAM: the sequential feature amplification is attributed to the normalization factor in the perturbation, but the derivation appears to require a strict ordering of coordinate magnitudes at initialization; it is unclear whether the phenomenon persists or reverses when coordinates have comparable magnitudes, which would affect the central claim that minor coordinates are amplified first.

minor comments (3)

[§3.1] The single-example dataset is used to illustrate the drastic change for L=2, but the multi-example extension in the general setup could be stated more explicitly to clarify which results carry over directly.
[§2] Notation for the perturbation radius ρ and the choice of ℓ∞ vs ℓ2 norm in the SAM update could be introduced earlier in the preliminaries to improve readability for readers unfamiliar with SAM variants.
[§5] Figure captions for the synthetic experiments should explicitly label the initialization scales used to demonstrate the sequential amplification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting this important point regarding the assumptions in our finite-time analysis. We address the comment below.

read point-by-point responses

Referee: [§4.2] §4.2, the finite-time analysis of ℓ2-SAM: the sequential feature amplification is attributed to the normalization factor in the perturbation, but the derivation appears to require a strict ordering of coordinate magnitudes at initialization; it is unclear whether the phenomenon persists or reverses when coordinates have comparable magnitudes, which would affect the central claim that minor coordinates are amplified first.

Authors: We agree that the closed-form derivation in §4.2 relies on a strict ordering of initial coordinate magnitudes to obtain explicit per-coordinate dynamics. This ordering simplifies the tracking of the amplification process but is not essential to the underlying mechanism. The ℓ2 normalization in the perturbation step scales each coordinate's update by the inverse of the overall gradient norm, which inherently gives a relative boost to coordinates with smaller magnitudes early in training. When initial magnitudes are comparable, the strict sequential ordering is indeed less pronounced and the transition phase is shorter; however, the initial bias toward minor coordinates remains due to the same normalization effect, and the dynamics still shift toward major coordinates as training proceeds. We have added a clarifying remark in the revised §4.2 explicitly stating the assumption and included additional numerical experiments for the comparable-magnitude regime. These experiments confirm that minor-first amplification persists (without reversal), supporting the central claim while qualifying its quantitative strength under relaxed assumptions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper derives the limit directions for ℓ∞-SAM and the sequential amplification for ℓ2-SAM directly from the explicit SAM perturbation and update rules applied to the L-layer linear diagonal network on separable data. Closed-form per-coordinate tracking follows from the diagonal structure and linear separability without any fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations. The contrast to GD is obtained by solving the same dynamics under the identical model, keeping the central claims independent of external unverified results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the mathematical tractability of linear diagonal networks and the assumption of linear separability, which allow explicit limit and dynamics calculations.

axioms (1)

domain assumption Training is performed on L-layer linear diagonal networks with linearly separable binary classification data.
This modeling choice enables closed-form analysis of limit directions and finite-time trajectories.

pith-pipeline@v0.9.0 · 5779 in / 1353 out tokens · 61348 ms · 2026-05-21T11:49:48.287120+00:00 · methodology

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)