Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks
Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3
The pith
In deep linear networks, the peak of SGD noise along each mode occurs before the mode's feature is fully learned.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the assumption of aligned and balanced weights, the training dynamics of SGD in deep linear networks can be decomposed into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. The stationary distribution of each mode coincides with the gradient-flow distribution without label noise and approximates a Boltzmann distribution with label noise. These relations hold qualitatively in simulations even without the alignment and balance assumptions.
What carries the argument
Exact decomposition of the SGD Langevin dynamics into independent one-dimensional per-mode SDEs under aligned and balanced weights.
If this is right
- SGD noise along each mode encodes the stage of feature acquisition for that mode.
- The overall saddle-to-saddle progression of training is not changed in character by the presence of SGD noise.
- Stationary distributions for individual modes can be written in closed form both with and without label noise.
- Qualitative predictions remain useful even when weights are not perfectly aligned or balanced.
Where Pith is reading between the lines
- Tracking the variance of per-mode updates during training could serve as an online diagnostic of which features are about to be acquired.
- The per-mode SDE reduction offers a template for analyzing noise effects in other models whose effective dynamics admit similar low-dimensional projections.
- The Boltzmann approximation with label noise suggests a direct link between SGD stationary behavior and thermodynamic interpretations of generalization.
Load-bearing premise
The network weights remain aligned and balanced throughout training.
What would settle it
A numerical simulation of SGD on a deep linear network with deliberately misaligned initial weights that shows the timing of peak per-mode diffusion no longer precedes feature acquisition.
Figures
read the original abstract
Deep linear networks (DLNs) are used as an analytically tractable model of the training dynamics of deep neural networks. While gradient descent in DLNs is known to exhibit saddle-to-saddle dynamics, the impact of stochastic gradient descent (SGD) noise on this regime remains poorly understood. We investigate the dynamics of SGD during training of DLNs in the saddle-to-saddle regime. We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. We also derive the stationary distribution of SGD for each mode: in the absence of label noise, its marginal distribution along specific features coincides with the stationary distribution of gradient flow, while in the presence of label noise it approximates a Boltzmann distribution. Finally, we confirm experimentally that the theoretical results hold qualitatively even without aligned or balanced weights. These results establish that SGD noise encodes information about the progression of feature learning but does not fundamentally alter the saddle-to-saddle dynamics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes SGD training dynamics of deep linear networks in the saddle-to-saddle regime by modeling them as anisotropic, state-dependent Langevin dynamics. Under the assumption of aligned and balanced weights, it derives an exact decomposition of the dynamics into a system of independent one-dimensional per-mode SDEs. This decomposition is used to establish that the peak diffusion along each mode precedes complete learning of the corresponding feature. The stationary distribution for each mode is also derived: it coincides with the gradient-flow stationary distribution in the absence of label noise and approximates a Boltzmann distribution when label noise is present. Qualitative experimental results are reported to hold even when the alignment and balance assumptions are relaxed.
Significance. If the central derivation holds, the work supplies a precise analytical characterization of how SGD noise encodes information about feature-learning progression in a tractable model of deep networks. The exact per-mode SDE reduction under the stated assumptions constitutes a clear technical strength, as does the explicit stationary-distribution analysis. The experimental confirmation that qualitative behavior persists without the assumption broadens the result's relevance beyond the idealized setting.
major comments (2)
- [§3.2] §3.2, Eq. (15): The exact per-mode SDE decomposition and the consequent precedence of maximal diffusion before feature completion are derived under the maintained assumption that weights remain aligned and balanced. No argument is given that SGD dynamics starting from typical random initializations preserve this property throughout the saddle-to-saddle trajectory; this assumption is load-bearing for the central analytical claim.
- [§5] §5, Figures 3–5: The experimental validation shows only qualitative agreement when alignment and balance are violated. No quantitative metric (e.g., relative shift in diffusion-peak timing or correlation between theoretical and observed feature-learning completion times) is supplied to bound the tolerable degree of misalignment, limiting assessment of how far the precedence result extends beyond the assumption.
minor comments (2)
- [§2.1] §2.1: The definition of the state-dependent noise covariance matrix could be stated more explicitly to clarify its anisotropy relative to the standard isotropic Langevin case.
- Notation: The symbol for the per-mode diffusion coefficient is reused in both the SDE and the stationary-distribution sections without an explicit cross-reference, which can confuse readers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive evaluation of our manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [§3.2] §3.2, Eq. (15): The exact per-mode SDE decomposition and the consequent precedence of maximal diffusion before feature completion are derived under the maintained assumption that weights remain aligned and balanced. No argument is given that SGD dynamics starting from typical random initializations preserve this property throughout the saddle-to-saddle trajectory; this assumption is load-bearing for the central analytical claim.
Authors: We acknowledge that the aligned and balanced weights assumption is essential for the exact per-mode SDE decomposition in Eq. (15) and that the manuscript provides no rigorous argument that SGD from random initializations preserves this property. The derivation is presented under the assumption, with experiments in Section 5 showing that key qualitative features persist when it is relaxed. In revision we will add a discussion of the assumption's validity under SGD together with numerical checks confirming that alignment deviations remain small during the saddle-to-saddle phase. revision: partial
-
Referee: [§5] §5, Figures 3–5: The experimental validation shows only qualitative agreement when alignment and balance are violated. No quantitative metric (e.g., relative shift in diffusion-peak timing or correlation between theoretical and observed feature-learning completion times) is supplied to bound the tolerable degree of misalignment, limiting assessment of how far the precedence result extends beyond the assumption.
Authors: We agree that quantitative metrics would allow a clearer assessment of robustness. We will revise Section 5 to include quantitative measures such as the relative timing shift between diffusion peaks and feature-learning completion, as well as correlation values between the theoretical per-mode predictions and the observed dynamics in the misaligned cases. revision: yes
Circularity Check
No significant circularity; derivation conditional on explicit assumption without reduction to inputs by construction.
full rationale
The paper explicitly invokes the assumption of aligned and balanced weights to obtain an exact decomposition of the SGD dynamics into independent one-dimensional per-mode SDEs. This assumption is stated as a prerequisite rather than derived from the target result. The precedence of maximal diffusion before complete feature learning is obtained by direct analysis of the resulting SDEs under that assumption. No parameters are fitted to a data subset and then presented as predictions of a related quantity, no load-bearing self-citations appear in the derivation chain, and no ansatz or uniqueness claim is imported from prior author work. The stationary-distribution results likewise follow from the same conditional SDEs. Experimental verification is reported as qualitative and separate from the analytic derivation. The overall chain is therefore self-contained against external benchmarks and exhibits no circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Aligned and balanced weights assumption
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
For index pairs (i, j) and (k, ℓ), the ((i, j),(k, ℓ)) entry of E[V V ⊤] is E[XiXjXkXℓ]. By Wick/Isserlis for centered Gaussian vectors, E[XiXjXkXℓ] =E[X iXj]E[XkXℓ] +E[X iXk]E[XjXℓ] +E[X iXℓ]E[XjXk]. SinceE[X aXb] =δ ab forX∼ N(0, I d0), this becomes E[XiXjXkXℓ] =δ ijδkℓ +δ ikδjℓ +δ iℓδjk . The three terms correspond respectively to vec(Id0)vec(Id0)⊤, th...
-
[2]
Under alignment, (W−M) α = (wα −s α)u αv⊤ α =⇒g l α = (wα −s α)A l,α
Gradient driftµ grad α (aligned⇒modewise GF) Population gradient blocks for squared loss with whitened inputs readg l =W ⊤ >l(W−M)W ⊤ <l. Under alignment, (W−M) α = (wα −s α)u αv⊤ α =⇒g l α = (wα −s α)A l,α. Hence µgrad α =− LX l=1 ⟨Al,α, gl⟩= (s α −w α) LX l=1 ∥Al,α∥2 F = (sα −w α) LX l=1 Y j̸=l w2 j,α. Imposing balancew 1,α =· · ·=w L,α =w 1/L α gives µ...
-
[3]
Itô driftµ Ito α (off-diagonal Hessian×covariance) Becausew α is multilinear in the{W l}, diagonal Hessian blocks vanish, and only off-diagonal blocks contribute: µIto α = η 2 X l̸=m ⟨Σlm, H lm[wα]⟩F . The mixed Hessian blockH lm is the bilinear form (forl > m; the other case is symmetric) D2 W l,W mwα[Hl, Hm] =u ⊤ α W>l Hl (Wl−1 · · ·W m+1)H m W<m vα. Th...
-
[4]
Orthogonality of different modes under alignment impliesD αβ = 0forα̸=β(no cross-mode diffusion)
Diffusion coefficientD α (mode-diagonal) The scalar diffusion along modeαis Dα =η a ⊤ α Σa α =η X l,m a⊤ l,αΣlm am,α. Orthogonality of different modes under alignment impliesD αβ = 0forα̸=β(no cross-mode diffusion). Data–mismatch part.A direct application of the vector identities yields, for each(l, m), a⊤ l,α(Al ⊗Bl∆)(Id2 0 +C)(A m ⊗Bm∆)⊤am,α = 2 (sα −w ...
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.