Suspicious Alignment of SGD: A Fine-Grained Step Size Condition Analysis
Pith reviewed 2026-05-16 13:07 UTC · model grok-4.3
The pith
A critical adaptive step size separates SGD alignment-decreasing from alignment-increasing regimes in low-alignment phases under ill-conditioned Hessians.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a high-dimensional quadratic setup with a sufficiently ill-conditioned Hessian that separates into dominant and bulk subspaces, an adaptive critical step size η_t^* exists such that, for low alignment, steps smaller than η_t^* decrease alignment while larger steps increase it; for high alignment the alignment always decreases regardless of step size. Under the same ill-conditioning there is an interval of step sizes for which bulk-space projections reduce loss while dominant-space projections increase loss. For any fixed step size and large random initialization these rules imply an initial alignment-decreasing phase followed by eventual stabilization at high alignment.
What carries the argument
The adaptive critical step size η_t^* that marks the transition between alignment-decreasing and alignment-increasing behavior in low-alignment regimes.
If this is right
- Below the critical step size η_t^* alignment decreases in low-alignment regimes.
- Above the critical step size η_t^* alignment increases in low-alignment regimes.
- High-alignment regimes exhibit decreasing alignment for every step size.
- There exists an interval of step sizes in which dominant-subspace projections raise the loss while bulk-subspace projections lower it.
- Constant-step-size SGD with large initialization exhibits an initial alignment-decreasing phase before stabilizing at high alignment.
Where Pith is reading between the lines
- Step-size schedules could be designed to keep the optimizer in the low-alignment regime if that proves advantageous for exploration.
- If real loss surfaces retain a split Hessian spectrum during training, the same critical-step-size logic may govern alignment dynamics beyond quadratics.
- An online estimate of the dominant-to-bulk gradient ratio could be used to approximate η_t^* on the fly for practical step-size adaptation.
Load-bearing premise
The loss is exactly quadratic and its Hessian has a clear gap separating a few large dominant eigenvalues from the bulk of smaller ones.
What would settle it
On a synthetic quadratic problem whose Hessian spectrum is known and split, compute η_t^* at each step and verify that the observed change in alignment has the opposite sign from the prediction whenever the chosen step size crosses η_t^*.
read the original abstract
This paper explores the suspicious alignment phenomenon in stochastic gradient descent (SGD) under ill-conditioned optimization, where the Hessian spectrum splits into dominant and bulk subspaces. This phenomenon describes the behavior of gradient alignment in SGD updates. Specifically, during the initial phase of SGD updates, the alignment between the gradient and the dominant subspace tends to decrease. Subsequently, it enters a rising phase and eventually stabilizes in a high-alignment phase. The alignment is considered ``suspicious'' because, paradoxically, the projected gradient update along this highly-aligned dominant subspace proves ineffective at reducing the loss. The focus of this work is to give a fine-grained analysis in a high-dimensional quadratic setup about how step size selection produces this phenomenon. Our main contribution can be summarized as follows: We propose a step-size condition revealing that in low-alignment regimes, an adaptive critical step size $\eta_t^*$ separates alignment-decreasing ($\eta_t < \eta_t^*$) from alignment-increasing ($\eta_t > \eta_t^*$) regimes, whereas in high-alignment regimes, the alignment is self-correcting and decreases regardless of the step size. We further show that under sufficient ill-conditioning, a step size interval exists where projecting the SGD updates to the bulk space decreases the loss while projecting them to the dominant space increases the loss, which explains a recent empirical observation that projecting gradient updates to the dominant subspace is ineffective. Finally, based on this adaptive step-size theory, we prove that for a constant step size and large initialization, SGD exhibits this distinct two-phase behavior: an initial alignment-decreasing phase, followed by stabilization at high alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper analyzes the suspicious alignment phenomenon in SGD on high-dimensional quadratic losses whose Hessians split into dominant and bulk eigenspaces under sufficient ill-conditioning. It derives an adaptive critical step size η_t^* that partitions alignment-decreasing (η_t < η_t^*) from alignment-increasing (η_t > η_t^*) regimes in low-alignment phases, shows that alignment is self-correcting and decreases for any step size in high-alignment phases, proves that bulk-subspace projections reduce loss while dominant-subspace projections increase it, and establishes that constant-step-size SGD with large initialization exhibits an initial alignment-decreasing phase followed by stabilization at high alignment.
Significance. If the derivations hold, the work supplies a precise, step-size-dependent explanation for observed alignment dynamics and the ineffectiveness of dominant-subspace updates, offering falsifiable predictions for the critical threshold η_t^* within the quadratic model. The closed-form analysis under a fixed spectral split is a strength that enables exact regime separation without post-hoc fitting.
major comments (2)
- [§3.1] §3.1 (quadratic dynamics) and the derivation of η_t^*: the sign of the alignment update and the loss-reduction property of the bulk projection are shown to flip at η_t^* only when the Hessian spectrum maintains a fixed gap between dominant and bulk eigenvalues throughout training. Because the model is exactly quadratic the gap is constant by construction, but the paper must state the minimal eigenvalue ratio (e.g., λ_dom / λ_bulk > 1 + δ) required for the threshold to remain valid; without this the regime-separation claim is not fully quantified.
- [§4] §4 (two-phase behavior): the proof that constant step size and large initialization produce an initial alignment-decreasing phase followed by high-alignment stabilization relies on the adaptive η_t^* condition persisting across phases. It is unclear whether the transition time is bounded independently of dimension or whether additional restrictions on the step size relative to the bulk eigenvalues are needed; this is load-bearing for the final claim.
minor comments (1)
- [Notation] The alignment measure (inner product between gradient and dominant subspace) and the orthogonal projections onto dominant/bulk subspaces should be defined in the notation section before the main theorems to improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [§3.1] §3.1 (quadratic dynamics) and the derivation of η_t^*: the sign of the alignment update and the loss-reduction property of the bulk projection are shown to flip at η_t^* only when the Hessian spectrum maintains a fixed gap between dominant and bulk eigenvalues throughout training. Because the model is exactly quadratic the gap is constant by construction, but the paper must state the minimal eigenvalue ratio (e.g., λ_dom / λ_bulk > 1 + δ) required for the threshold to remain valid; without this the regime-separation claim is not fully quantified.
Authors: We agree that the regime-separation claim benefits from an explicit minimal eigenvalue ratio. In the revised manuscript we will add in §3.1 the precise condition λ_dom / λ_bulk > 1 + δ (with δ > 0 derived from the requirement that η_t^* > 0 and the sign-flip inequalities hold) to quantify the sufficient ill-conditioning assumption already used in the derivations. revision: yes
-
Referee: [§4] §4 (two-phase behavior): the proof that constant step size and large initialization produce an initial alignment-decreasing phase followed by high-alignment stabilization relies on the adaptive η_t^* condition persisting across phases. It is unclear whether the transition time is bounded independently of dimension or whether additional restrictions on the step size relative to the bulk eigenvalues are needed; this is load-bearing for the final claim.
Authors: The proof in §4 already establishes persistence of the adaptive condition under the fixed spectral split of the quadratic model. To make the dimension independence explicit, we will revise §4 to state the transition-time bound (which depends only on initial alignment, step size, and the fixed eigenvalue ratio) and the additional restriction η < 2/λ_bulk required for bulk-subspace stability. These clarifications will be added without changing the existing proof structure. revision: yes
Circularity Check
No circularity: critical step-size condition derived from quadratic dynamics without reduction to inputs
full rationale
The paper derives the adaptive critical step size η_t^* and the alignment-decreasing/increasing regime separation directly from the projected SGD update equations in a fixed-Hessian quadratic model. The dominant/bulk split is an explicit modeling assumption, not a fitted quantity renamed as a prediction, and no load-bearing step reduces to a self-citation or self-definition. The two-phase behavior for constant step size follows from the same closed-form dynamics without circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The optimization problem is modeled as a high-dimensional quadratic whose Hessian spectrum splits into dominant and bulk subspaces under sufficient ill-conditioning
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a step-size condition revealing that in low-alignment regimes, an adaptive critical step size η_t^* separates alignment-decreasing (η_t < η_t^*) from alignment-increasing (η_t > η_t^*) regimes
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hessian spectrum splits into dominant and bulk subspaces... gap1 := λ_k - λ_{k+1} > 0
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.