arxiv: 2604.18450 · v1 · submitted 2026-04-20 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Recognition: unknown

Random Matrix Theory of Early-Stopped Gradient Flow: A Transient BBP Scenario

Florentin Coeurdoux , Gr\'egoire Ferr\'e , Jean-Philippe Bouchaud

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:50 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH

keywords random matrix theorygradient flowearly stoppingBBP transitionteacher-student modelanisotropic covariancespectral dynamicsoverfitting

0 comments

The pith

Anisotropic covariance in linear teacher-student models produces a transient window where the signal eigenvalue separates from the noise bulk before reabsorption during gradient flow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a solvable random-matrix model of gradient flow on linear weights with a two-block anisotropic input covariance. It solves the time-dependent bulk spectrum exactly with a 2x2 Dyson equation and locates the isolated teacher eigenvalue with a rank-two determinant condition. This produces three regimes: the signal spike never appears, persists forever, or appears only temporarily before merging back into the bulk. A sympathetic reader cares because the temporary separation supplies an exact, minimal mechanism for the empirically observed early-stopping regime in which useful signal is visible for a finite training interval before overfitting erases it.

Core claim

In the two-block covariance model, the full time-dependent bulk spectrum of the symmetrized weight matrix is obtained through a 2x2 Dyson equation, while the outlier condition for a rank-one teacher follows from an explicit rank-two determinant formula; the resulting dynamics yield a transient Baik-Ben Arous-Péchè transition in which the teacher spike emerges and is later reabsorbed into the bulk depending on signal strength and the degree of anisotropy.

What carries the argument

The 2x2 Dyson equation for the resolvent of the symmetrized weight matrix together with the rank-two determinant condition that locates the time-dependent outlier eigenvalue produced by the rank-one teacher.

If this is right

Phase diagrams in the plane of signal strength versus anisotropy ratio delineate the three regimes of no spike, persistent spike, and transient spike.
Finite-size simulations match the closed-form time-dependent eigenvalue predictions.
Early stopping corresponds to halting training while the teacher eigenvalue remains isolated.
The model supplies a minimal solvable account of early stopping as a transient spectral phenomenon driven by anisotropy and noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If real data exhibit comparable block anisotropy in their covariance, the optimal early-stopping time could be estimated directly from a covariance estimate without running the full optimizer.
The same transient separation may appear in deeper networks whenever successive layers induce effective fast and slow feature directions.
Related transient eigenvalue behavior could arise in other first-order methods whose effective covariance is anisotropic.

Load-bearing premise

The linear teacher-student setting together with a two-block anisotropic covariance model is rich enough to capture the essential transient spectral mechanism.

What would settle it

Simulate gradient flow on finite-N instances of the two-block linear model and check whether the largest eigenvalue of the weight matrix follows the predicted trajectory of temporary separation from the bulk edge at the analytically computed times.

Figures

Figures reproduced from arXiv: 2604.18450 by Florentin Coeurdoux, Gr\'egoire Ferr\'e, Jean-Philippe Bouchaud.

**Figure 1.** Figure 1: Empirical spectral density of St (blue histogram, 20 realizations, N = 500) versus the theoretical density from the 2 × 2 Dyson equation (13)–(14) (black curve) at eight training times. Parameters: λ+ = 1, λ− = 0.1, γ = 1, and α = 0.5. In the signal case θ = 6, the red vertical line marks the outlier location as predicted in Section 4 when an outlier is present. One clearly sees how the outlier first emerg… view at source ↗

**Figure 2.** Figure 2: Evolution of θc defined in (16) for α = 0.5 and two values of λ−. 4.3 Phase diagrams and the role of anisotropy The curve θ = θc(t) in the (θ, t) plane directly separates the three regimes of Theorem 4.4. Horizontal cuts at fixed θ give the fitting set Tθ , and the minimum of the boundary curve corresponds to the minimum of θc(t) visible in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Phase diagram in the (θ, t) plane for λ− = 0.1, γ = 1 and α = 0.5. The critical curve θ = θc(t) separates three regimes: no outlier (underfitting), transient outlier (early-stopping window), and persistent outlier. For a fixed intermediate value of θ, the outlier emerges at t1 and is reabsorbed at t2. The overlaid red curve shows the optimal stopping time topt(θ) defined in Section 4.4: in our experiments … view at source ↗

**Figure 4.** Figure 4: Outlier-regime classification in the (θ, λ−) plane for λ+ = 1 and γ = 1, α = 0.5, obtained by scanning over t ∈ [0.05, 3000]. Blue: no outlier ever emerges. Red: an outlier emerges and remains separated from the bulk. Yellow: an outlier emerges and is later reabsorbed, corresponding to the transient early-stopping regime. recovery is the squared overlap qt := (u ⋆ t · v) 2 between the outlier eigenvector u… view at source ↗

**Figure 5.** Figure 5: Teacher-direction overlap qt = (u ⋆ t · v) 2 as a function of training time t for a full-rank power-law covariance spectrum (β = 1.5, λ ∈ [0.1, 5], N = 400, 8 realisations averaged). The dotted horizontal line marks the random baseline 1/N. Weak signals (θ ≲ 1) never produce a detectable outlier. Intermediate signals show a clear transient: the overlap rises, peaks, and returns to baseline, reproducing the… view at source ↗

read the original abstract

Empirical studies of trained models often report a transient regime in which signal is detectable in a finite gradient descent time window before overfitting dominates. We provide an analytically tractable random-matrix model that reproduces this phenomenon for gradient flow in a linear teacher--student setting. In this framework, learning occurs when an isolated eigenvalue separates from a noisy bulk, before eventually disappearing in the overfitting regime. The key ingredient is anisotropy in the input covariance, which induces fast and slow directions in the learning dynamics. In a two-block covariance model, we derive the full time-dependent bulk spectrum of the symmetrized weight matrix through a $2\times 2$ Dyson equation, and we obtain an explicit outlier condition for a rank-one teacher via a rank-two determinant formula. This yields a transient Baik-Ben Arous-P\'ech\'e (BBP) transition: depending on signal strength and covariance anisotropy, the teacher spike may never emerge, emerge and persist, or emerge only during an intermediate time interval before being reabsorbed into the bulk. We map the corresponding phase diagrams and validate the theory against finite-size simulations. Our results provide a minimal solvable mechanism for early stopping as a transient spectral effect driven by anisotropy and noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

This paper derives a time-dependent RMT model showing anisotropy can produce a transient BBP outlier during gradient flow before reabsorption, offering a mechanism for early stopping in the linear case. They set up a linear teacher-student problem with a two-block input covariance that creates fast and slow directions. From there they close a 2x2 Dyson equation for the symmetrized weight matrix to get the full bulk spectrum as a function of time, then use a rank-two determinant condition to locate the rank-one teacher spike. The result is a set of phase diagrams in signal strength and anisotropy ratio where the spike either never appears, stays out, or emerges only in a finite time window before vanishing again. They check the predictions against finite-size simulations and the match looks reasonable. That is the concrete advance: an explicit, solvable transient version of the BBP transition that does not collapse to the usual static formulas. The derivations rely on standard RMT tools applied to the stated model, and the phase diagrams follow directly without fitted parameters. The main limitation is that everything sits inside the linear two-block setting. If the signal-anisotropy coupling introduces cross terms that the 2x2 closure misses, the predicted time window for the transient spike would shift. The model also does not address how the effect would survive in nonlinear networks or with more realistic covariances. This is worth a referee for anyone working on spectral views of generalization or early stopping. The math is self-contained and the simulations provide a basic check, so an editor should send it out even if revisions are needed on the closure step.

Referee Report

2 major / 2 minor

Summary. The paper develops a random matrix theory analysis of gradient flow in a linear teacher-student model with two-block anisotropic input covariance. It derives the full time-dependent bulk spectrum of the symmetrized weight matrix from a 2×2 Dyson equation and obtains an explicit condition for the rank-one teacher outlier via a rank-two determinant formula. This produces phase diagrams showing a transient BBP transition: the teacher spike may never separate, separate and persist, or separate only transiently before reabsorption into the bulk, depending on signal strength and anisotropy. The results are validated against finite-size simulations.

Significance. If the central derivations hold exactly, the work supplies a minimal, solvable mechanism explaining empirically observed transient signal detectability as a spectral effect driven by anisotropy-induced fast/slow directions interacting with noise. The explicit time-dependent bulk spectrum, closed-form outlier condition, and mapped phase diagrams constitute falsifiable predictions; the direct simulation validation strengthens the RMT approach for dynamic high-dimensional learning. This is a clear strength for understanding early stopping without invoking non-linearities or data-specific structure.

major comments (2)

[Derivation of time-dependent bulk spectrum (via 2×2 Dyson equation)] The time-dependent 2×2 Dyson closure for the symmetrized resolvent (central derivation of the bulk spectrum): the paper must demonstrate that the hierarchy closes exactly when the rank-one teacher signal evolves and couples to the two-block covariance; any residual cross terms between the signal and the fast/slow noise directions would alter the predicted time window of outlier existence and thereby the transient BBP regime boundaries.
[Outlier condition (rank-two determinant formula)] The rank-two determinant formula for the outlier condition: this inherits the same closure assumption as the bulk spectrum; explicit verification is required that no additional interaction terms arise under signal-anisotropy coupling, as these would shift the reabsorption time and undermine the claim that the outlier can be reabsorbed into the explicitly computed bulk.

minor comments (2)

Ensure the symmetrization operation on the weight matrix is defined explicitly at first use, as the abstract refers to the 'symmetrized weight matrix' without prior definition.
The phase diagrams would benefit from explicit labeling of the three regimes (never, persistent, transient) directly on the plots for immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need for explicit verification of the Dyson closure and outlier condition. We address each major comment below and will incorporate additional details in the revision.

read point-by-point responses

Referee: [Derivation of time-dependent bulk spectrum (via 2×2 Dyson equation)] The time-dependent 2×2 Dyson closure for the symmetrized resolvent (central derivation of the bulk spectrum): the paper must demonstrate that the hierarchy closes exactly when the rank-one teacher signal evolves and couples to the two-block covariance; any residual cross terms between the signal and the fast/slow noise directions would alter the predicted time window of outlier existence and thereby the transient BBP regime boundaries.

Authors: In Section 3.2 we obtain the 2×2 Dyson equation for the symmetrized resolvent by averaging over the Gaussian noise and exploiting the block-diagonal structure of the input covariance together with the rank-one character of the teacher. The signal enters only through a deterministic mean-field term; because the two covariance blocks are orthogonal, all cross terms between the signal direction and the fast/slow noise subspaces vanish identically in the large-N limit. Consequently the hierarchy closes exactly at the 2×2 level. To make this cancellation fully transparent we will add an appendix that expands the self-consistent equations term by term and shows the vanishing of higher-order contributions. revision: yes
Referee: [Outlier condition (rank-two determinant formula)] The rank-two determinant formula for the outlier condition: this inherits the same closure assumption as the bulk spectrum; explicit verification is required that no additional interaction terms arise under signal-anisotropy coupling, as these would shift the reabsorption time and undermine the claim that the outlier can be reabsorbed into the explicitly computed bulk.

Authors: The rank-two determinant condition for the outlier is obtained by requiring that the resolvent (already closed at the 2×2 level) possesses a pole outside the support of the bulk spectrum. Because the bulk spectrum itself is derived under exact closure, the same resolvent automatically encodes the signal-anisotropy coupling; no supplementary interaction terms appear. The reabsorption time is then fixed by the moment when this pole collides with the moving bulk edge. We will include the explicit verification of the absence of extra terms in the same new appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: standard RMT closure on explicit two-block model

full rationale

The paper states a linear teacher-student model with two-block anisotropic covariance, then applies the standard Dyson equation to the symmetrized resolvent (yielding the 2×2 closure for the bulk spectrum) and the rank-two determinant condition for the rank-one outlier. Both steps are direct algebraic consequences of the model definition and the usual resolvent identities; no parameter is fitted to data and then re-labeled as a prediction, no self-citation supplies a uniqueness theorem or ansatz, and the transient BBP phase diagram is obtained by solving the resulting explicit time-dependent equations. The derivation therefore remains self-contained against the stated assumptions and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the linear teacher-student setup and two-block covariance assumption; parameters like signal strength and anisotropy are model inputs, not post-hoc fits, and no new entities are postulated.

free parameters (2)

signal strength
Controls whether and when the teacher spike emerges in the outlier condition.
covariance anisotropy
Induces the fast and slow directions that enable the transient separation.

axioms (2)

standard math The symmetrized weight matrix dynamics are captured by a 2x2 Dyson equation for the bulk spectrum.
Standard RMT technique for time-dependent eigenvalue distributions.
domain assumption Input covariance follows a two-block anisotropic model.
Assumed to produce the fast/slow learning directions central to the transient effect.

pith-pipeline@v0.9.0 · 5532 in / 1474 out tokens · 66340 ms · 2026-05-10T03:50:19.062114+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33(5):1643–1697, 2005

Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33(5):1643–1697, 2005

2005
[2]

Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

2021
[3]

Random matrix analysis of deep neural network weight matrices.Physical Review E, 106(5):054124, 2022

Matthias Thamm, Max Staats, and Bernd Rosenow. Random matrix analysis of deep neural network weight matrices.Physical Review E, 106(5):054124, 2022

2022
[4]

Boundary between noise and information applied to filtering neural network weight matrices.Physical Review E, 108(2):L022302, 2023

Max Staats, Matthias Thamm, and Bernd Rosenow. Boundary between noise and information applied to filtering neural network weight matrices.Physical Review E, 108(2):L022302, 2023

2023
[5]

Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2408.11804, 2024

David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, and Matthew R Walter. Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2408.11804, 2024

work page arXiv 2024
[6]

The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices.Advances in Mathematics, 227(1):494–521, 2011

Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices.Advances in Mathematics, 227(1):494–521, 2011

2011
[7]

Saxe, James L

Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InInternational Conference on Learning Representations, 2014

2014
[8]

Advani, Andrew M

Madhu S. Advani, Andrew M. Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

2020
[9]

Zico Kolter, and Ryan J

Alnur Ali, J. Zico Kolter, and Ryan J. Tibshirani. A continuous-time view of early stopping for least squares regression. InProceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 1370–1378. PMLR, 2019

2019
[10]

From sgd to spectra: A theory of neural network weight dynamics.arXiv preprint arXiv:2507.12709,

Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, and Anirudh Gajula. From sgd to spectra: A theory of neural network weight dynamics. arXiv preprint arXiv:2507.12709, 2025

work page arXiv 2025
[11]

The role of the time-dependent hessian in high-dimensional optimization.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):083401, 2025

Tony Bonnaire, Giulio Biroli, and Chiara Cammarota. The role of the time-dependent hessian in high-dimensional optimization.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):083401, 2025

2025
[12]

Optimal errors and phase transitions in high-dimensional generalized linear models

Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019

2019
[13]

Universality for general Wigner- type matrices.Probability Theory and Related Fields, 169(3):667–727, 2017

Oskari H Ajanki, László Erd˝ os, and Torben Krüger. Universality for general Wigner- type matrices.Probability Theory and Related Fields, 169(3):667–727, 2017. 12 Transient BBP Transition in Gradient Flow References

2017
[14]

Stability of the matrix Dyson equation and random matrices with correlations.Probability Theory and Related Fields, 173(1):293–373, 2019

Oskari H Ajanki, László Erd˝ os, and Torben Krüger. Stability of the matrix Dyson equation and random matrices with correlations.Probability Theory and Related Fields, 173(1):293–373, 2019

2019
[15]

High-temperature expansions and message passing algorithms.Journal of Statistical Mechanics: Theory and Experiment, 2019(11):113301, 2019

Antoine Maillard, Laura Foini, Alejandro Lage Castellanos, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. High-temperature expansions and message passing algorithms.Journal of Statistical Mechanics: Theory and Experiment, 2019(11):113301, 2019

2019
[16]

Scaling and renormalization in high-dimensional regression

Alexander Atanasov, Jacob A Zavatone-Veth, and Cengiz Pehlevan. Scaling and renormalization in high-dimensional regression.arXiv preprint arXiv:2405.00592, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Disordered dynamics in high dimensions: Connec- tions to random matrices and machine learning.arXiv preprint arXiv:2601.01010, 2026

Blake Bordelon and Cengiz Pehlevan. Disordered dynamics in high dimensions: Connections to random matrices and machine learning.arXiv preprint arXiv:2601.01010, 2026

work page arXiv 2026
[18]

Cambridge University Press, 2020

Marc Potters and Jean-Philippe Bouchaud.A first course in random matrix theory: for physicists, engineers and data scientists. Cambridge University Press, 2020

2020
[19]

Neural tangent kernel: Con- vergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Con- vergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31, pages 8571–8580, 2018

2018
[20]

Disentangling trainability and generalization in deep neural networks

Lechao Xiao, Jeffrey Pennington, and Samuel S Schoenholz. Disentangling trainability and generalization in deep neural networks. InProceedings of the 37th International Conference on Machine Learning, pages 10462–10472. PMLR, 2020. 13 Transient BBP Transition in Gradient Flow A Proof of Proposition 3.1 A Proof of Proposition 3.1 Proof. We fix a time t≥ 0 ...

2020
[21]

The bulk spectrum is generically wider (since the frozen initialisation noise never decays), making it harder for an outlier to emerge
[22]

The null-space directions do not contribute to ψ or χ (since they carry no learned signal), but they do contribute to ϕ (through mC), which enters the right-hand side of the outlier equation. As a result, the transient BBP window [t1, t2] is genericallynarrowerfor γ< 1 than for γ> 1: more directions carry permanent noise without contributing to learning, ...