pith. sign in

arxiv: 2511.02258 · v2 · submitted 2025-11-04 · 📊 stat.ML · cs.LG· math.PR· math.ST· stat.TH

Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

Pith reviewed 2026-05-18 01:51 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PRmath.STstat.TH
keywords stochastic gradient descenthigh-dimensional scaling limitssingle-layer networksOrnstein-Uhlenbeck processphase diagraminformation exponenteffective dynamics
0
0 comments X

The pith

In the critical step-size regime for online SGD on high-dimensional single-layer networks, a stochastic correction term emerges that alters the phase diagram and reduces local dynamics to an Ornstein-Uhlenbeck process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies the high-dimensional scaling limits of online stochastic gradient descent applied to single-layer networks. Below a critical scaling of the step size, the effective dynamics follow deterministic ballistic limits. At the critical scale, a correction term appears that modifies the overall phase diagram of the training process. Near fixed points in this regime the diffusive limits simplify to an Ornstein-Uhlenbeck process under the paper's stated conditions. These findings indicate that the information exponent governs sample complexity and that deterministic limits alone miss key stochastic fluctuations.

Core claim

The paper claims that below the critical regime the effective dynamics of SGD are governed by deterministic ballistic limits, whereas at the critical scale a new correction term emerges that changes the phase diagram. In this regime, near fixed points the corresponding diffusive SDE limits reduce to an Ornstein-Uhlenbeck process under certain conditions on the single-layer network and loss. The results illustrate how the information exponent controls sample complexity and the limitations of deterministic scaling limits in capturing stochastic fluctuations.

What carries the argument

The critical scaling regime of the step size, which supplies a stochastic correction to the effective dynamics and enables their local reduction to an Ornstein-Uhlenbeck process.

If this is right

  • The phase diagram of the learning dynamics is modified by the appearance of the correction term at the critical scale.
  • Sample complexity is governed by the information exponent once the step size enters the critical regime.
  • Local behavior near fixed points follows an Ornstein-Uhlenbeck process rather than the full diffusive SDE.
  • Deterministic scaling limits are insufficient to describe stochastic fluctuations in high-dimensional SGD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reduction to an Ornstein-Uhlenbeck process near equilibria may permit explicit calculations of fluctuation-induced escape times from suboptimal states.
  • Tuning the step size precisely to the critical value could be used to balance deterministic progress against controlled stochastic exploration in practice.
  • The same critical-regime analysis might apply to other online optimization algorithms that share similar effective dynamics.

Load-bearing premise

The high-dimensional scaling limits and effective dynamics of SGD continue to hold when the step size reaches the critical regime for single-layer networks.

What would settle it

Numerical simulations of single-layer networks trained by SGD at the critical step size in which the local trajectories near fixed points deviate from Ornstein-Uhlenbeck statistics would disprove the claimed reduction.

read the original abstract

This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD). Building on the recent work of Ben Arous, Gheissari, and Jagannath on the effective dynamics of SGD, we study the critical scaling regime of the step size for single-layer networks. Below this critical regime, the effective dynamics are governed by deterministic (ballistic) limits, whereas at the critical scale, a new correction term emerges that changes the phase diagram. In this regime, near fixed points, the corresponding diffusive (SDE) limits of the effective dynamics reduce to an Ornstein-Uhlenbeck process under certain conditions. These results highlight how the information exponent controls sample complexity and illustrate the limitations of deterministic scaling limits in capturing stochastic fluctuations in high-dimensional learning dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript derives high-dimensional scaling limits for online SGD on single-layer networks, extending the effective-dynamics framework of Ben Arous, Gheissari, and Jagannath. It identifies a critical step-size regime: below this scale the limits are deterministic and ballistic, while at the critical scale a new correction term appears that modifies the phase diagram. Near fixed points the associated diffusive SDE limits reduce to an Ornstein-Uhlenbeck process under stated conditions. The results emphasize the role of the information exponent in governing sample complexity and the insufficiency of purely deterministic approximations for capturing fluctuations.

Significance. If the derivations hold, the work supplies a precise characterization of the ballistic-to-diffusive transition and the associated correction term, thereby refining theoretical predictions for SGD convergence and generalization in high dimensions. The explicit link to the information exponent and the OU reduction near equilibria offer concrete tools for analyzing when stochastic effects dominate, which could inform both algorithm design and the design of loss landscapes.

major comments (2)
  1. [Theorem 3.1 / §3] The central claim that a new correction term emerges at the critical scale and alters the phase diagram is load-bearing, yet the manuscript does not display the explicit form of this term or the modified drift/diffusion coefficients (e.g., in the statement of the main limit theorem). Without this, it is impossible to verify that the term is not an artifact of normalization or of the prior effective-dynamics assumptions.
  2. [§4.2] The reduction of the diffusive SDE to an Ornstein-Uhlenbeck process near fixed points is asserted 'under certain conditions,' but those conditions (linearization assumptions, spectral gap requirements, or bounds on the Hessian) are not stated explicitly or shown to be satisfied by the single-layer network loss. This omission affects the applicability of the OU approximation.
minor comments (2)
  1. [Abstract / §1] The abstract and introduction should include a short, self-contained statement of the precise scaling regime (step-size exponent relative to dimension and information exponent) rather than referring readers solely to the prior Ben Arous et al. framework.
  2. [§2] Notation for the effective drift, diffusion coefficient, and information exponent should be introduced once and used consistently; currently the same symbol appears to denote both the original and the corrected quantities in different sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and positive recommendation for minor revision. We address each major comment below and have prepared revisions to the manuscript accordingly.

read point-by-point responses
  1. Referee: [Theorem 3.1 / §3] The central claim that a new correction term emerges at the critical scale and alters the phase diagram is load-bearing, yet the manuscript does not display the explicit form of this term or the modified drift/diffusion coefficients (e.g., in the statement of the main limit theorem). Without this, it is impossible to verify that the term is not an artifact of normalization or of the prior effective-dynamics assumptions.

    Authors: We agree with the referee that the explicit form of the correction term is important for clarity and verifiability. In the revised version, we will update the statement of Theorem 3.1 to include the explicit expressions for the modified drift and diffusion coefficients arising from the critical scaling. These expressions are derived in the proof of the theorem from the high-dimensional asymptotic analysis of the SGD updates, and we will make them part of the main theorem statement to address this concern. revision: yes

  2. Referee: [§4.2] The reduction of the diffusive SDE to an Ornstein-Uhlenbeck process near fixed points is asserted 'under certain conditions,' but those conditions (linearization assumptions, spectral gap requirements, or bounds on the Hessian) are not stated explicitly or shown to be satisfied by the single-layer network loss. This omission affects the applicability of the OU approximation.

    Authors: Thank you for this comment. We acknowledge that the conditions for reducing to the Ornstein-Uhlenbeck process were not stated with sufficient explicitness. In the revision, we will add a precise statement of the conditions in §4.2, including the linearization of the effective dynamics around the fixed point, the requirement of a spectral gap in the linearized operator, and bounds on the Hessian of the loss function. We will also confirm that these conditions are satisfied for the single-layer network under the paper's assumptions on the information exponent and the activation function. revision: yes

Circularity Check

0 steps flagged

No significant circularity; extension of external prior framework

full rationale

The paper explicitly builds on the effective dynamics derived in Ben Arous, Gheissari, and Jagannath (distinct authors) and analyzes the critical step-size regime for single-layer networks. The ballistic-to-diffusive transition, emergence of the correction term altering the phase diagram, and reduction of the SDE limits to an Ornstein-Uhlenbeck process near fixed points are derived from the imposed scaling and linearization under the paper's stated conditions on the network and loss. No load-bearing step reduces by construction to a self-fit, self-citation chain, or ansatz smuggled from the same authors; the cited prior results function as independent external input rather than tautological re-derivation. The derivation chain therefore remains self-contained relative to its external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; the ledger reflects explicit dependencies stated in the abstract. The central claim rests on the prior effective-dynamics framework and the existence of a well-defined critical scaling regime.

axioms (1)
  • domain assumption The effective dynamics of SGD derived by Ben Arous, Gheissari, and Jagannath hold in the high-dimensional limit for single-layer networks.
    The paper explicitly builds on this recent work as the base for studying the critical regime.

pith-pipeline@v0.9.0 · 5661 in / 1385 out tokens · 53019 ms · 2026-05-18T01:51:36.164604+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 6 internal anchors

  1. [1]

    Andreas Anastasiou, Krishnakumar Balasubramanian, and Murat A. Erdogdu. Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT. InProceedings of the Thirty-Second Conference on Learning Theory, pages 115–137. PMLR, June 2019. ISSN: 2640-3498

  2. [2]

    A mean-field limit for certain deep neural networks

    Dyego Ara´ ujo, Roberto I. Oliveira, and Daniel Yukimura. A mean-field limit for certain deep neural networks, June 2019. arXiv:1906.00193 [math]

  3. [3]

    Online stochastic gradient descent on non- convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

    Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non- convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

  4. [4]

    High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Communications on Pure and Applied Mathematics, 77(3):2030– 2080, 2024

    G´ erard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Communications on Pure and Applied Mathematics, 77(3):2030– 2080, 2024. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.22169

  5. [5]

    Dynamics of stochastic approximation algorithms

    Michel Bena¨ ım. Dynamics of stochastic approximation algorithms. In Jacques Az´ ema, Michel ´Emery, Michel Ledoux, and Marc Yor, editors,S´ eminaire de Probabilit´ es XXXIII, pages 1–68, Berlin, Heidelberg,

  6. [6]

    Springer, Berlin, Heidelberg, 1990

    Albert Benveniste, Michel M´ etivier, and Pierre Priouret.Adaptive Algorithms and Stochastic Approxi- mations. Springer, Berlin, Heidelberg, 1990

  7. [7]

    Biehl and H

    M. Biehl and H. Schwarze. Learning by on-line gradient descent.Journal of Physics A: Mathematical and General, 28(3):643, February 1995

  8. [8]

    On-line Learning and Stochastic Approximations

    L´ eon Bottou. On-line Learning and Stochastic Approximations. In David Saad, editor,On-Line Learning in Neural Networks, Publications of the Newton Institute, pages 9–42. Cambridge University Press, Cambridge, 1999

  9. [9]

    The high-dimensional asymptotics of first order methods with random data

    Michael Celentano, Chen Cheng, and Andrea Montanari. The high-dimensional asymptotics of first order methods with random data, December 2021. arXiv:2112.07572 [math]

  10. [10]

    On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

    Lenaic Chizat and Francis Bach. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport, October 2018. arXiv:1805.09545 [math]

  11. [11]

    Hitting the High- Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models, August

    Elizabeth Collins-Woodfin, Courtney Paquette, Elliot Paquette, and Inbar Seroussi. Hitting the High- Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models, August

  12. [12]

    arXiv:2308.08977 [math]

  13. [13]

    Paul Dupuis and Harold J. Kushner. Stochastic Approximation and Large Deviations: Upper Bounds and w.p.1 Convergence.SIAM Journal on Control and Optimization, 27(5):1108–1135, September 1989. Publisher: Society for Industrial and Applied Mathematics

  14. [14]

    Advani, Andrew M

    Sebastian Goldt, Madhu S. Advani, Andrew M. Saxe, Florent Krzakala, and Lenka Zdeborov´ a. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup.Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124010, December 2020. arXiv:1906.08632 [stat]

  15. [15]

    Nicholas J. A. Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non- smooth stochastic gradient descent. InProceedings of the Thirty-Second Conference on Learning Theory, pages 1579–1613. PMLR, June 2019. ISSN: 2640-3498

  16. [16]

    Harold J. Kushner. Asymptotic behavior of stochastic approximation and large deviations. InThe 22nd IEEE Conference on Decision and Control, pages 75–81, December 1983

  17. [17]

    Diffusion Approximations for Online Principal Component Estimation and Global Convergence

    Chris Junchi Li, Mengdi Wang, Han Liu, and Tong Zhang. Diffusion Approximations for Online Principal Component Estimation and Global Convergence. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. 10

  18. [18]

    Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations

    Qianxiao Li, Cheng Tai, and Weinan E. Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations, November 2018. arXiv:1811.01558 [cs]

  19. [19]

    On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs), June 2021

    Zhiyuan Li, Sadhika Malladi, and Sanjeev Arora. On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs), June 2021. arXiv:2102.12470 [cs]

  20. [20]

    L. Ljung. Analysis of recursive stochastic algorithms.IEEE Transactions on Automatic Control, 22(4):551–575, August 1977. Conference Name: IEEE Transactions on Automatic Control

  21. [21]

    Stochastic Gradient Descent as Approximate Bayesian Inference

    Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic Gradient Descent as Approximate Bayesian Inference, January 2018. arXiv:1704.04289 [stat]

  22. [22]

    D. L. McLeish. Functional and random central limit theorems for the Robbins-Munro process.Journal of Applied Probability, 13(1):148–154, March 1976

  23. [23]

    A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, August 2018

    Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, August 2018. Publisher: Proceedings of the National Academy of Sciences

  24. [24]

    Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

    Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. InAdvances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

  25. [25]

    A Stochastic Approximation Method.The Annals of Mathematical Statistics, 22(3):400–407, September 1951

    Herbert Robbins and Sutton Monro. A Stochastic Approximation Method.The Annals of Mathematical Statistics, 22(3):400–407, September 1951. Publisher: Institute of Mathematical Statistics

  26. [26]

    Rotskoff and Eric Vanden-Eijnden

    Grant M. Rotskoff and Eric Vanden-Eijnden. Trainability and Accuracy of Neural Networks: An Inter- acting Particle System Approach.Communications on Pure and Applied Mathematics, 75(9):1889–1935, September 2022. arXiv:1805.00915 [stat]

  27. [27]

    David Saad and Sara A. Solla. Exact Solution for On-Line Learning in Multilayer Neural Networks. Physical Review Letters, 74(21):4337–4340, May 1995. Publisher: American Physical Society

  28. [28]

    Mean Field Analysis of Neural Networks: A Central Limit Theorem

    Justin Sirignano and Konstantinos Spiliopoulos. Mean Field Analysis of Neural Networks: A Central Limit Theorem, June 2019. arXiv:1808.09372 [math]

  29. [29]

    Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval, October 2019

    Yan Shuo Tan and Roman Vershynin. Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval, October 2019. arXiv:1910.12837 [stat]

  30. [30]

    Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks, June 2023

    Rodrigo Veiga, Ludovic Stephan, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborov´ a. Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks, June 2023. arXiv:2202.00293 [stat]. 11