pith. machine review for the scientific record.
sign in

arxiv: 2604.06366 · v1 · submitted 2026-04-07 · 💻 cs.LG · stat.ML

Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks

Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords stochastic gradient descentdeep linear networkssaddle-to-saddle dynamicsstochastic differential equationsfeature learningLangevin dynamicsstationary distribution
0
0 comments X

The pith

In deep linear networks, the peak of SGD noise along each mode occurs before the mode's feature is fully learned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models SGD training of deep linear networks as Langevin dynamics with state-dependent noise and shows that, when weights stay aligned and balanced, the high-dimensional dynamics reduce exactly to a collection of independent one-dimensional stochastic differential equations, one per singular mode. This reduction makes it possible to prove that diffusion intensity along a given mode reaches its maximum immediately before that mode's feature is completely acquired. The same per-mode equations also yield the stationary distribution for each coordinate: it matches the gradient-flow stationary measure in the absence of label noise and approximates a Boltzmann distribution when label noise is present. Experiments indicate these timing and distributional relations remain qualitatively intact even when the alignment and balance assumptions are dropped.

Core claim

Under the assumption of aligned and balanced weights, the training dynamics of SGD in deep linear networks can be decomposed into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. The stationary distribution of each mode coincides with the gradient-flow distribution without label noise and approximates a Boltzmann distribution with label noise. These relations hold qualitatively in simulations even without the alignment and balance assumptions.

What carries the argument

Exact decomposition of the SGD Langevin dynamics into independent one-dimensional per-mode SDEs under aligned and balanced weights.

If this is right

  • SGD noise along each mode encodes the stage of feature acquisition for that mode.
  • The overall saddle-to-saddle progression of training is not changed in character by the presence of SGD noise.
  • Stationary distributions for individual modes can be written in closed form both with and without label noise.
  • Qualitative predictions remain useful even when weights are not perfectly aligned or balanced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tracking the variance of per-mode updates during training could serve as an online diagnostic of which features are about to be acquired.
  • The per-mode SDE reduction offers a template for analyzing noise effects in other models whose effective dynamics admit similar low-dimensional projections.
  • The Boltzmann approximation with label noise suggests a direct link between SGD stationary behavior and thermodynamic interpretations of generalization.

Load-bearing premise

The network weights remain aligned and balanced throughout training.

What would settle it

A numerical simulation of SGD on a deep linear network with deliberately misaligned initial weights that shows the timing of peak per-mode diffusion no longer precedes feature acquisition.

Figures

Figures reproduced from arXiv: 2604.06366 by Alexander Gietelink Oldenziel, Alexander Strang, Avi Semler, Guillaume Corlouer.

Figure 1
Figure 1. Figure 1: Predicting when modes are learned using the predicted time of maximum diffusion, in a 4-layer linear network trained with SGD. (a) The modes are learned in order of magnitude, with the time of learning being predicted by the time of maximum diffusion. (b) The diffusion along a mode peaks while a mode is being learned, and our theoretical prediction (see Equation 7) matches what is observed. The vertical li… view at source ↗
Figure 2
Figure 2. Figure 2: Saddle-to-saddle dynamics with different optimizers in a depth-6 linear network, with the sharp changes corresponding to increases in the numerical rank of the network. (a) Train loss plateaus with discrete optimizers (b) Train loss plateaus with numerical simulation of their continuous counterparts (c) Mode growth over training with gradient descent, showing that the 5 singular values of the teacher matri… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of diffusion along modes with mode amplitude, for (a) SGD and (b) an Euler-Maruyama simulation of anisotropic Langevin dynamics. In each column, the empirical value of ηaα(θ) ⊤Σ(θ)aα(θ) is shown on the left and Proposition 3.5’s theoretical prediction for Dα(θ) is shown on the right. The shaded bands show the time that the corresponding mode is learned. In agreement with the theoretical predicti… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the empirical distributions of the amplitude of the first mode w0 at the end of training for SGD, anisotropic Gaussian, and isotropic Gaussian noise. In the absence of label noise, SGD (a) concentrates entirely on the value of the top singular value of the teacher matrix, but anisotropic noise (b) does not have this behavior. However, the variance of the distribution for anisotropic noise is … view at source ↗
Figure 5
Figure 5. Figure 5: End-of-training distribution in the presence of label noise (variance 0.1) of the amplitude of the first mode w0 for SGD. The distribution changes from being concentrated at a point to having greater variance [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Measuring magnitude of cross modes over SGD training. For both a depth-2 and a depth-4 linear network, we observe that the majority of cross modes are small for most of training, except for a small number of cross modes which peak at times corresponding to modes being learned. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Measuring balance from balanced and unbalanced initialization. (a) shows that from an unbalanced Gaussian initialization, there is not strong balance at the start of training, but as training continues balance increases. (b) When we enforce balance at initialization, it is approximately maintained, and for all of training the weights are significantly more balanced than at any point in the unbalanced-initi… view at source ↗
Figure 8
Figure 8. Figure 8: Frobenius norms of weight matrices over SGD training. Frobenius norms of layers increase over training, accounting for most of the decrease in normalized balance measure rl from standard initializations. To give a sense of the scale of the (im)balance, we also plot the unnormalized numerator of Equation 33 and also the Frobenius norm of each layer in [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Learning rate versus maximum diffusion for each mode. We observe a linear relationship between learning rate and the diffusion along each mode [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Batch size versus maximum diffusion for each mode. We observe a linear relationship between the reciprocal of batch size and the diffusion along each mode. J.4 DLN architecture So far, all experiments shown are with a rectangular DLN architecture (rectangular means that all the weight matrices are square) [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Empirical versus theoretical modewise diffusion for a non-rectangular DLN. The structure of the diffusion, including the location of the peaks and tending to zero once the mode is learned is maintained [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Variance of the end-of-training distribution of the first mode versus simulation fineness. We vary the fineness (i.e., the ∆t parameter) of the Euler-Maruyama simulation of the stochastic gradient flow SDE (Equation 1). For finer simulations, the variance reduces. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
read the original abstract

Deep linear networks (DLNs) are used as an analytically tractable model of the training dynamics of deep neural networks. While gradient descent in DLNs is known to exhibit saddle-to-saddle dynamics, the impact of stochastic gradient descent (SGD) noise on this regime remains poorly understood. We investigate the dynamics of SGD during training of DLNs in the saddle-to-saddle regime. We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. We also derive the stationary distribution of SGD for each mode: in the absence of label noise, its marginal distribution along specific features coincides with the stationary distribution of gradient flow, while in the presence of label noise it approximates a Boltzmann distribution. Finally, we confirm experimentally that the theoretical results hold qualitatively even without aligned or balanced weights. These results establish that SGD noise encodes information about the progression of feature learning but does not fundamentally alter the saddle-to-saddle dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes SGD training dynamics of deep linear networks in the saddle-to-saddle regime by modeling them as anisotropic, state-dependent Langevin dynamics. Under the assumption of aligned and balanced weights, it derives an exact decomposition of the dynamics into a system of independent one-dimensional per-mode SDEs. This decomposition is used to establish that the peak diffusion along each mode precedes complete learning of the corresponding feature. The stationary distribution for each mode is also derived: it coincides with the gradient-flow stationary distribution in the absence of label noise and approximates a Boltzmann distribution when label noise is present. Qualitative experimental results are reported to hold even when the alignment and balance assumptions are relaxed.

Significance. If the central derivation holds, the work supplies a precise analytical characterization of how SGD noise encodes information about feature-learning progression in a tractable model of deep networks. The exact per-mode SDE reduction under the stated assumptions constitutes a clear technical strength, as does the explicit stationary-distribution analysis. The experimental confirmation that qualitative behavior persists without the assumption broadens the result's relevance beyond the idealized setting.

major comments (2)
  1. [§3.2] §3.2, Eq. (15): The exact per-mode SDE decomposition and the consequent precedence of maximal diffusion before feature completion are derived under the maintained assumption that weights remain aligned and balanced. No argument is given that SGD dynamics starting from typical random initializations preserve this property throughout the saddle-to-saddle trajectory; this assumption is load-bearing for the central analytical claim.
  2. [§5] §5, Figures 3–5: The experimental validation shows only qualitative agreement when alignment and balance are violated. No quantitative metric (e.g., relative shift in diffusion-peak timing or correlation between theoretical and observed feature-learning completion times) is supplied to bound the tolerable degree of misalignment, limiting assessment of how far the precedence result extends beyond the assumption.
minor comments (2)
  1. [§2.1] §2.1: The definition of the state-dependent noise covariance matrix could be stated more explicitly to clarify its anisotropy relative to the standard isotropic Langevin case.
  2. Notation: The symbol for the per-mode diffusion coefficient is reused in both the SDE and the stationary-distribution sections without an explicit cross-reference, which can confuse readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive evaluation of our manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [§3.2] §3.2, Eq. (15): The exact per-mode SDE decomposition and the consequent precedence of maximal diffusion before feature completion are derived under the maintained assumption that weights remain aligned and balanced. No argument is given that SGD dynamics starting from typical random initializations preserve this property throughout the saddle-to-saddle trajectory; this assumption is load-bearing for the central analytical claim.

    Authors: We acknowledge that the aligned and balanced weights assumption is essential for the exact per-mode SDE decomposition in Eq. (15) and that the manuscript provides no rigorous argument that SGD from random initializations preserves this property. The derivation is presented under the assumption, with experiments in Section 5 showing that key qualitative features persist when it is relaxed. In revision we will add a discussion of the assumption's validity under SGD together with numerical checks confirming that alignment deviations remain small during the saddle-to-saddle phase. revision: partial

  2. Referee: [§5] §5, Figures 3–5: The experimental validation shows only qualitative agreement when alignment and balance are violated. No quantitative metric (e.g., relative shift in diffusion-peak timing or correlation between theoretical and observed feature-learning completion times) is supplied to bound the tolerable degree of misalignment, limiting assessment of how far the precedence result extends beyond the assumption.

    Authors: We agree that quantitative metrics would allow a clearer assessment of robustness. We will revise Section 5 to include quantitative measures such as the relative timing shift between diffusion peaks and feature-learning completion, as well as correlation values between the theoretical per-mode predictions and the observed dynamics in the misaligned cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation conditional on explicit assumption without reduction to inputs by construction.

full rationale

The paper explicitly invokes the assumption of aligned and balanced weights to obtain an exact decomposition of the SGD dynamics into independent one-dimensional per-mode SDEs. This assumption is stated as a prerequisite rather than derived from the target result. The precedence of maximal diffusion before complete feature learning is obtained by direct analysis of the resulting SDEs under that assumption. No parameters are fitted to a data subset and then presented as predictions of a related quantity, no load-bearing self-citations appear in the derivation chain, and no ansatz or uniqueness claim is imported from prior author work. The stationary-distribution results likewise follow from the same conditional SDEs. Experimental verification is reported as qualitative and separate from the analytic derivation. The overall chain is therefore self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The derivation depends on the domain assumption of aligned and balanced weights to reduce the dynamics to independent one-dimensional SDEs. No free parameters or new postulated entities are introduced in the abstract.

axioms (1)
  • domain assumption Aligned and balanced weights assumption
    Invoked to obtain the exact decomposition of the high-dimensional SGD dynamics into a system of independent one-dimensional SDEs.

pith-pipeline@v0.9.0 · 5514 in / 1174 out tokens · 28593 ms · 2026-05-10T18:37:40.133249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    By Wick/Isserlis for centered Gaussian vectors, E[XiXjXkXℓ] =E[X iXj]E[XkXℓ] +E[X iXk]E[XjXℓ] +E[X iXℓ]E[XjXk]

    For index pairs (i, j) and (k, ℓ), the ((i, j),(k, ℓ)) entry of E[V V ⊤] is E[XiXjXkXℓ]. By Wick/Isserlis for centered Gaussian vectors, E[XiXjXkXℓ] =E[X iXj]E[XkXℓ] +E[X iXk]E[XjXℓ] +E[X iXℓ]E[XjXk]. SinceE[X aXb] =δ ab forX∼ N(0, I d0), this becomes E[XiXjXkXℓ] =δ ijδkℓ +δ ikδjℓ +δ iℓδjk . The three terms correspond respectively to vec(Id0)vec(Id0)⊤, th...

  2. [2]

    Under alignment, (W−M) α = (wα −s α)u αv⊤ α =⇒g l α = (wα −s α)A l,α

    Gradient driftµ grad α (aligned⇒modewise GF) Population gradient blocks for squared loss with whitened inputs readg l =W ⊤ >l(W−M)W ⊤ <l. Under alignment, (W−M) α = (wα −s α)u αv⊤ α =⇒g l α = (wα −s α)A l,α. Hence µgrad α =− LX l=1 ⟨Al,α, gl⟩= (s α −w α) LX l=1 ∥Al,α∥2 F = (sα −w α) LX l=1 Y j̸=l w2 j,α. Imposing balancew 1,α =· · ·=w L,α =w 1/L α gives µ...

  3. [3]

    The mixed Hessian blockH lm is the bilinear form (forl > m; the other case is symmetric) D2 W l,W mwα[Hl, Hm] =u ⊤ α W>l Hl (Wl−1 · · ·W m+1)H m W<m vα

    Itô driftµ Ito α (off-diagonal Hessian×covariance) Becausew α is multilinear in the{W l}, diagonal Hessian blocks vanish, and only off-diagonal blocks contribute: µIto α = η 2 X l̸=m ⟨Σlm, H lm[wα]⟩F . The mixed Hessian blockH lm is the bilinear form (forl > m; the other case is symmetric) D2 W l,W mwα[Hl, Hm] =u ⊤ α W>l Hl (Wl−1 · · ·W m+1)H m W<m vα. Th...

  4. [4]

    Orthogonality of different modes under alignment impliesD αβ = 0forα̸=β(no cross-mode diffusion)

    Diffusion coefficientD α (mode-diagonal) The scalar diffusion along modeαis Dα =η a ⊤ α Σa α =η X l,m a⊤ l,αΣlm am,α. Orthogonality of different modes under alignment impliesD αβ = 0forα̸=β(no cross-mode diffusion). Data–mismatch part.A direct application of the vector identities yields, for each(l, m), a⊤ l,α(Al ⊗Bl∆)(Id2 0 +C)(A m ⊗Bm∆)⊤am,α = 2 (sα −w ...