Suspicious Alignment of SGD: A Fine-Grained Step Size Condition Analysis

Boyao Liao; Minhak Song; Shenyang Deng; Tianyu Pang; Yaoqing Yang; Zhuoli Ouyang

arxiv: 2601.11789 · v2 · submitted 2026-01-16 · 💻 cs.LG

Suspicious Alignment of SGD: A Fine-Grained Step Size Condition Analysis

Shenyang Deng , Boyao Liao , Zhuoli Ouyang , Tianyu Pang , Minhak Song , Yaoqing Yang This is my paper

Pith reviewed 2026-05-16 13:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords stochastic gradient descentgradient alignmentstep size conditionill-conditioned optimizationHessian spectrumquadratic lossalignment phasesloss reduction

0 comments

The pith

A critical adaptive step size separates SGD alignment-decreasing from alignment-increasing regimes in low-alignment phases under ill-conditioned Hessians.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a precise step-size condition that governs the suspicious alignment phenomenon in SGD on high-dimensional quadratic losses whose Hessian spectrum splits into dominant and bulk parts. In low-alignment regimes an adaptive threshold η_t^* divides step sizes that shrink alignment from those that grow it; once alignment becomes high the dynamics self-correct and alignment falls for any step size. If the condition holds it accounts for the observation that projecting updates onto the dominant subspace can raise the loss while bulk projections lower it, and it predicts that constant-step SGD begun from large initialization will first lose alignment before locking into a high-alignment state. Readers would care because the result supplies concrete, checkable rules for choosing step sizes that keep gradient updates effective at reducing loss rather than merely aligning them with unhelpful directions.

Core claim

In a high-dimensional quadratic setup with a sufficiently ill-conditioned Hessian that separates into dominant and bulk subspaces, an adaptive critical step size η_t^* exists such that, for low alignment, steps smaller than η_t^* decrease alignment while larger steps increase it; for high alignment the alignment always decreases regardless of step size. Under the same ill-conditioning there is an interval of step sizes for which bulk-space projections reduce loss while dominant-space projections increase loss. For any fixed step size and large random initialization these rules imply an initial alignment-decreasing phase followed by eventual stabilization at high alignment.

What carries the argument

The adaptive critical step size η_t^* that marks the transition between alignment-decreasing and alignment-increasing behavior in low-alignment regimes.

If this is right

Below the critical step size η_t^* alignment decreases in low-alignment regimes.
Above the critical step size η_t^* alignment increases in low-alignment regimes.
High-alignment regimes exhibit decreasing alignment for every step size.
There exists an interval of step sizes in which dominant-subspace projections raise the loss while bulk-subspace projections lower it.
Constant-step-size SGD with large initialization exhibits an initial alignment-decreasing phase before stabilizing at high alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Step-size schedules could be designed to keep the optimizer in the low-alignment regime if that proves advantageous for exploration.
If real loss surfaces retain a split Hessian spectrum during training, the same critical-step-size logic may govern alignment dynamics beyond quadratics.
An online estimate of the dominant-to-bulk gradient ratio could be used to approximate η_t^* on the fly for practical step-size adaptation.

Load-bearing premise

The loss is exactly quadratic and its Hessian has a clear gap separating a few large dominant eigenvalues from the bulk of smaller ones.

What would settle it

On a synthetic quadratic problem whose Hessian spectrum is known and split, compute η_t^* at each step and verify that the observed change in alignment has the opposite sign from the prediction whenever the chosen step size crosses η_t^*.

read the original abstract

This paper explores the suspicious alignment phenomenon in stochastic gradient descent (SGD) under ill-conditioned optimization, where the Hessian spectrum splits into dominant and bulk subspaces. This phenomenon describes the behavior of gradient alignment in SGD updates. Specifically, during the initial phase of SGD updates, the alignment between the gradient and the dominant subspace tends to decrease. Subsequently, it enters a rising phase and eventually stabilizes in a high-alignment phase. The alignment is considered ``suspicious'' because, paradoxically, the projected gradient update along this highly-aligned dominant subspace proves ineffective at reducing the loss. The focus of this work is to give a fine-grained analysis in a high-dimensional quadratic setup about how step size selection produces this phenomenon. Our main contribution can be summarized as follows: We propose a step-size condition revealing that in low-alignment regimes, an adaptive critical step size $\eta_t^*$ separates alignment-decreasing ($\eta_t < \eta_t^*$) from alignment-increasing ($\eta_t > \eta_t^*$) regimes, whereas in high-alignment regimes, the alignment is self-correcting and decreases regardless of the step size. We further show that under sufficient ill-conditioning, a step size interval exists where projecting the SGD updates to the bulk space decreases the loss while projecting them to the dominant space increases the loss, which explains a recent empirical observation that projecting gradient updates to the dominant subspace is ineffective. Finally, based on this adaptive step-size theory, we prove that for a constant step size and large initialization, SGD exhibits this distinct two-phase behavior: an initial alignment-decreasing phase, followed by stabilization at high alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives an adaptive critical step size that splits alignment regimes in quadratic SGD and proves the two-phase trajectory for constant steps with large init, but the fixed-Hessian split is the load-bearing assumption.

read the letter

The core contribution is the explicit adaptive threshold η_t^* that tells you when a step increases or decreases alignment in the low-alignment regime, plus the clean proof that constant step size plus large initialization produces the initial drop followed by high-alignment stabilization. They also show why bulk projections reduce loss while dominant ones can increase it under sufficient ill-conditioning. That matches the empirical pattern they cite and gives a mechanistic story inside the quadratic model that earlier work left vaguer. The derivations look internally consistent on the terms they set up, and the separation into dominant and bulk eigenspaces is handled carefully enough to produce the regime switch without extra fitting parameters. The two-phase result for constant steps is the part that feels most directly usable for thinking about initialization and step-size schedules. The main limitation is that everything is derived for a fixed quadratic loss whose Hessian never changes and whose spectral gap stays stable. Once curvature evolves or the landscape deviates from quadratic, the sign of the alignment update and the loss-reduction property of the projections are no longer guaranteed by the same threshold. The abstract does not report numerical checks on how sensitive the critical step size is to small perturbations of the Hessian, so the practical range of the result is still open. This is worth sending to referees who work on optimization theory. Readers interested in SGD dynamics on ill-conditioned problems will find the step-size condition and the two-phase proof useful even if they treat the quadratic case as a starting point rather than a final model. I would bring it to a reading group focused on theoretical optimization and would cite the η_t^* construction if I needed a reference for alignment phase transitions. It deserves peer review because the central claims are new within the quadratic literature and the math is presented at a level that can be checked.

Referee Report

2 major / 1 minor

Summary. This paper analyzes the suspicious alignment phenomenon in SGD on high-dimensional quadratic losses whose Hessians split into dominant and bulk eigenspaces under sufficient ill-conditioning. It derives an adaptive critical step size η_t^* that partitions alignment-decreasing (η_t < η_t^*) from alignment-increasing (η_t > η_t^*) regimes in low-alignment phases, shows that alignment is self-correcting and decreases for any step size in high-alignment phases, proves that bulk-subspace projections reduce loss while dominant-subspace projections increase it, and establishes that constant-step-size SGD with large initialization exhibits an initial alignment-decreasing phase followed by stabilization at high alignment.

Significance. If the derivations hold, the work supplies a precise, step-size-dependent explanation for observed alignment dynamics and the ineffectiveness of dominant-subspace updates, offering falsifiable predictions for the critical threshold η_t^* within the quadratic model. The closed-form analysis under a fixed spectral split is a strength that enables exact regime separation without post-hoc fitting.

major comments (2)

[§3.1] §3.1 (quadratic dynamics) and the derivation of η_t^*: the sign of the alignment update and the loss-reduction property of the bulk projection are shown to flip at η_t^* only when the Hessian spectrum maintains a fixed gap between dominant and bulk eigenvalues throughout training. Because the model is exactly quadratic the gap is constant by construction, but the paper must state the minimal eigenvalue ratio (e.g., λ_dom / λ_bulk > 1 + δ) required for the threshold to remain valid; without this the regime-separation claim is not fully quantified.
[§4] §4 (two-phase behavior): the proof that constant step size and large initialization produce an initial alignment-decreasing phase followed by high-alignment stabilization relies on the adaptive η_t^* condition persisting across phases. It is unclear whether the transition time is bounded independently of dimension or whether additional restrictions on the step size relative to the bulk eigenvalues are needed; this is load-bearing for the final claim.

minor comments (1)

[Notation] The alignment measure (inner product between gradient and dominant subspace) and the orthogonal projections onto dominant/bulk subspaces should be defined in the notation section before the main theorems to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§3.1] §3.1 (quadratic dynamics) and the derivation of η_t^*: the sign of the alignment update and the loss-reduction property of the bulk projection are shown to flip at η_t^* only when the Hessian spectrum maintains a fixed gap between dominant and bulk eigenvalues throughout training. Because the model is exactly quadratic the gap is constant by construction, but the paper must state the minimal eigenvalue ratio (e.g., λ_dom / λ_bulk > 1 + δ) required for the threshold to remain valid; without this the regime-separation claim is not fully quantified.

Authors: We agree that the regime-separation claim benefits from an explicit minimal eigenvalue ratio. In the revised manuscript we will add in §3.1 the precise condition λ_dom / λ_bulk > 1 + δ (with δ > 0 derived from the requirement that η_t^* > 0 and the sign-flip inequalities hold) to quantify the sufficient ill-conditioning assumption already used in the derivations. revision: yes
Referee: [§4] §4 (two-phase behavior): the proof that constant step size and large initialization produce an initial alignment-decreasing phase followed by high-alignment stabilization relies on the adaptive η_t^* condition persisting across phases. It is unclear whether the transition time is bounded independently of dimension or whether additional restrictions on the step size relative to the bulk eigenvalues are needed; this is load-bearing for the final claim.

Authors: The proof in §4 already establishes persistence of the adaptive condition under the fixed spectral split of the quadratic model. To make the dimension independence explicit, we will revise §4 to state the transition-time bound (which depends only on initial alignment, step size, and the fixed eigenvalue ratio) and the additional restriction η < 2/λ_bulk required for bulk-subspace stability. These clarifications will be added without changing the existing proof structure. revision: yes

Circularity Check

0 steps flagged

No circularity: critical step-size condition derived from quadratic dynamics without reduction to inputs

full rationale

The paper derives the adaptive critical step size η_t^* and the alignment-decreasing/increasing regime separation directly from the projected SGD update equations in a fixed-Hessian quadratic model. The dominant/bulk split is an explicit modeling assumption, not a fitted quantity renamed as a prediction, and no load-bearing step reduces to a self-citation or self-definition. The two-phase behavior for constant step size follows from the same closed-form dynamics without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on modeling the loss as a high-dimensional quadratic whose Hessian has a clear dominant-bulk spectral split; these are standard domain assumptions rather than derived quantities.

axioms (1)

domain assumption The optimization problem is modeled as a high-dimensional quadratic whose Hessian spectrum splits into dominant and bulk subspaces under sufficient ill-conditioning
Invoked throughout the fine-grained step-size analysis and the two-phase behavior proof.

pith-pipeline@v0.9.0 · 5606 in / 1372 out tokens · 43548 ms · 2026-05-16T13:07:24.885163+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a step-size condition revealing that in low-alignment regimes, an adaptive critical step size η_t^* separates alignment-decreasing (η_t < η_t^*) from alignment-increasing (η_t > η_t^*) regimes
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hessian spectrum splits into dominant and bulk subspaces... gap1 := λ_k - λ_{k+1} > 0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.