Recognition: 2 theorem links
· Lean TheoremLarge Spikes in Stochastic Gradient Descent: A Large-Deviations View
Pith reviewed 2026-05-15 13:26 UTC · model grok-4.3
The pith
Large loss spikes in SGD escape sharp minima and reduce curvature to favor flatter solutions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the NTK scaling for shallow fully connected networks, large loss spikes during SGD occur with at least polynomial probability and form the dominant mechanism by which sharp minima are escaped and curvature is reduced, thereby favoring flatter solutions; the catapult phase splits into inflationary and deflationary regimes set by an explicit log-drift criterion.
What carries the argument
Large-deviations analysis that splits the catapult phase into inflationary and deflationary regimes via an explicit log-drift criterion and identifies spikes as the primary escape route from sharp minima.
If this is right
- Spikes occur with polynomial probability in both inflationary and deflationary regimes.
- Spikes dominate escape from sharp minima and the associated curvature reduction.
- This process favors flatter solutions over sharp ones.
- Analogous spike behavior and escape dynamics hold for certain ReLU networks.
- Curriculum learning can be designed to exploit the identified regimes.
Where Pith is reading between the lines
- If spike dominance extends to deeper networks, it may clarify why overparameterized models generalize despite non-convex loss landscapes.
- Training runs could be monitored for the log-drift threshold to predict when spikes will appear and alter curvature.
- The regime split suggests that learning-rate schedules inducing controlled spikes could improve final flatness without explicit regularization.
Load-bearing premise
The analysis is limited to shallow fully connected networks in the NTK scaling regime.
What would settle it
An empirical run on the same shallow network class in which large spikes are rare or fail to produce measurable curvature reduction would falsify the dominance claim.
Figures
read the original abstract
Large loss spikes in stochastic gradient descent are studied through a rigorous large-deviations analysis for a shallow, fully connected network in the NTK scaling. In contrast to full-batch gradient descent, the catapult phase is shown to split into inflationary and deflationary regimes, determined by an explicit log-drift criterion. In both cases, large spikes are shown to be at least polynomially likely. In addition, these spikes are shown to be the dominant mechanism by which sharp minima are escaped and curvature is reduced, thereby favouring flatter solutions. Corresponding results are also obtained for certain ReLU networks, and implications for curriculum learning are derived.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a rigorous large-deviations analysis of large loss spikes during stochastic gradient descent for shallow fully connected networks in the NTK scaling regime. It demonstrates that the catapult phase splits into inflationary and deflationary regimes according to an explicit log-drift criterion, shows that large spikes occur with at least polynomial probability in both regimes, and claims that these spikes constitute the dominant mechanism for escaping sharp minima, thereby reducing curvature and favoring flatter solutions. Analogous results are derived for certain ReLU networks, along with implications for curriculum learning.
Significance. If the central claims hold, particularly the dominance of spikes in escaping sharp minima, this work provides a valuable theoretical framework linking SGD dynamics to the preference for flat minima, which is often associated with better generalization. The explicit criteria and polynomial likelihood results strengthen the analysis, and the extension to ReLU networks and curriculum learning adds practical relevance. However, the restriction to shallow networks in the NTK regime limits immediate broader impact.
major comments (1)
- [Abstract] The claim that spikes are the dominant mechanism for escaping sharp minima (Abstract) requires an explicit variational comparison: the infimum of the large-deviation rate function over spike trajectories must be shown to be strictly smaller than over non-spike alternatives such as gradual diffusion or noise-driven curvature reduction. The log-drift criterion and polynomial-likelihood results establish occurrence but do not include this rate-function comparison, leaving dominance dependent on an implicit assumption.
minor comments (1)
- [Section 3] Clarify the precise statement of the log-drift criterion when transitioning between inflationary and deflationary regimes to ensure the splitting is unambiguous for readers.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. We address the single major comment below and will incorporate a revision to strengthen the dominance claim.
read point-by-point responses
-
Referee: [Abstract] The claim that spikes are the dominant mechanism for escaping sharp minima (Abstract) requires an explicit variational comparison: the infimum of the large-deviation rate function over spike trajectories must be shown to be strictly smaller than over non-spike alternatives such as gradual diffusion or noise-driven curvature reduction. The log-drift criterion and polynomial-likelihood results establish occurrence but do not include this rate-function comparison, leaving dominance dependent on an implicit assumption.
Authors: We agree that an explicit variational comparison would make the dominance claim fully rigorous. The current analysis shows that spikes occur with at least polynomial probability under the log-drift criterion in both regimes, and that the NTK dynamics make such jumps the natural escape route from sharp minima. However, we did not compute or bound the infima of the rate function over spike versus non-spike (e.g., gradual diffusion) paths. In the revision we will add this comparison, deriving or bounding the large-deviation rate for representative spike trajectories and showing it is strictly smaller than the rate for continuous paths that must fight the mean drift over long times. This will be placed in the large-deviations section and referenced from the abstract. revision: yes
Circularity Check
No significant circularity; derivation applies external large-deviations theory
full rationale
The paper derives the inflationary/deflationary split via an explicit log-drift criterion and establishes polynomial likelihood of spikes using standard large-deviations principles applied to the NTK scaling regime for shallow FC networks. The dominance claim for spike-mediated escape is presented as a consequence of the rate-function analysis rather than reducing to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or steps exhibit the enumerated circular patterns; the work remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption NTK scaling regime for shallow fully connected networks
- standard math Standard large-deviations principles apply to the stochastic gradient process
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_eq_pow, embed_add echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
μ(t)≈∏(1−ηλ0 si(t)²)μ0; log|μ(t)|≈log|μ0|+∑log|1−ηλ0 si(u)|; G(λ0)=∑pi log|1−ηλ0 si²|; ϑ(λ)=sup{θ:∑pi|1−ηλ si²|^θ≤1}
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel, Jcost_pos_of_ne_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
large spikes are, up to exponentially unlikely events, the only way … curvature reduced (Proposition 4.2); slow escape exponentially improbable
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Atish Agarwala and Jeffrey Pennington. High dimensional analysis reveals conservative sharpening and a stochastic edge of stability.CoRR, abs/2404.19261, 2024
-
[2]
Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisit- ing the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024
-
[3]
Understanding gradient descent on the edge of stability in deep learning
Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. InInternational Conference on Machine Learning, pages 948–1024. PMLR, 2022
work page 2022
-
[4]
What is the long-run distribution of stochastic gradient descent? a large deviations analysis
Waïss Azizian, Franck Iutzeler, Jerome Malick, and Panayotis Mertikopoulos. What is the long-run distribution of stochastic gradient descent? a large deviations analysis. In International Conference on Machine Learning, pages 2168–2229. PMLR, 2024
work page 2024
-
[5]
Waïss Azizian, Franck Iutzeler, Jerome Malick, and Panayotis Mertikopoulos. The global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations. InInternational Conference on Machine Learning, pages 1982–2044. PMLR, 2025
work page 1982
-
[6]
Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process
Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. InConference on learning theory, pages 483–513. PMLR, 2020
work page 2020
-
[7]
Etienne Boursier and Nicolas Flammarion. Early alignment in two-layer networks training is a two-edged sword.Journal of Machine Learning Research, 26(183):1–75, 2025
work page 2025
-
[8]
Simplicity bias and optimization threshold in two-layer relu networks
Etienne Boursier and Nicolas Flammarion. Simplicity bias and optimization threshold in two-layer relu networks. InICML 2025-International Conference on Machine Learning, volume 267, 2025
work page 2025
-
[9]
Dennis Chemnitz and Maximilian Engel. Characterizing dynamical stability of stochastic gradient descent in overparameterized learning.Journal of Machine Learning Research, 26(134):1–46, 2025
work page 2025
-
[10]
Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming.Advances in neural information processing systems, 32, 2019
work page 2019
-
[11]
Stablemoe: Stable routing strategy for mixture of experts
Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. Stablemoe: Stable routing strategy for mixture of experts. InACL (1), pages 7085–7095, 2022. 32 Spikes in SGD: an LDP View Gess, Heydecker
work page 2022
-
[12]
Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594, 2022
-
[13]
Convergence of shallow relu networks on weakly interacting data
Léo Dana, Loucas Pillaud-Vivien, and Francis Bach. Convergence of shallow relu networks on weakly interacting data. InNeural Information Processing Systems 2025, 2025
work page 2025
-
[14]
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high- dimensionalnon-convexoptimization.Advances in neural information processing systems, 27, 2014
work page 2014
-
[15]
Springer Science & Business Media, 2009
Amir Dembo and Ofer Zeitouni.Large deviations techniques and applications, volume 38. Springer Science & Business Media, 2009
work page 2009
-
[16]
Gradient descent provably optimizes over-parameterized neural networks
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. InInternational Conference on Learning Representations
-
[17]
Mathieu Even, Scott Pesme, Suriya Gunasekar, and Nicolas Flammarion. (s) gd over diagonal linear networks: Implicit bias, large stepsizes and edge of stability.Advances in Neural Information Processing Systems, 36:29406–29448, 2023
work page 2023
-
[18]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[19]
Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024
-
[20]
The early phase of neural network training
Jonathan Frankle, David J Schwab, and Ari S Morcos. The early phase of neural network training. InInternational Conference on Learning Representations
-
[21]
Large-scale matrix factorization with distributed stochastic gradient descent
Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. InProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77, 2011
work page 2011
-
[22]
A loss curvature perspective on training instabilities of deep learning models
Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Edward Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instabilities of deep learning models. InInternational Conference on Learning Representations, 2022
work page 2022
-
[23]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[24]
Flat minima.Neural computation, 9(1):1–42, 1997
Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural computation, 9(1):1–42, 1997
work page 1997
-
[25]
American Mathematical Soc., 2000
Frank Hollander.Large deviations, volume 14. American Mathematical Soc., 2000
work page 2000
-
[26]
Densely connected convolutional networks
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017
work page 2017
-
[27]
Hikaru Ibayashi and Masaaki Imaizumi. Quasi-potential theory for escape problem: Quantitative sharpness effect on sgd’s escape from local minima. 2021
work page 2021
-
[28]
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence andgeneralizationinneuralnetworks.Advances in neural information processing systems, 31, 2018
work page 2018
-
[29]
Three Factors Influencing Minima in SGD
Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Improving Generalization Performance by Switching from Adam to SGD
Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd.arXiv preprint arXiv:1712.07628, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Guanghui Lan.First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020. 33 Spikes in SGD: an LDP View Gess, Heydecker
work page 2020
-
[32]
The two regimes of deep network training
Guillaume Leclerc and Aleksander Madry. The two regimes of deep network training. arXiv preprint arXiv:2002.10376, 2020
-
[33]
Deep learning.nature, 521(7553):436– 444, 2015
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436– 444, 2015
work page 2015
-
[34]
Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. InNeural networks: Tricks of the trade, pages 9–50. Springer, 2002
work page 2002
-
[35]
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020
-
[36]
Stochastic modified equations and adaptive stochastic gradient algorithms
Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and adaptive stochastic gradient algorithms. InInternational Conference on Machine Learning, pages 2101–2110. PMLR, 2017
work page 2017
-
[37]
Yuanzhi Liand Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data.Advances in neural information processing systems, 31, 2018
work page 2018
-
[38]
Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks.Advances in neural information processing systems, 32, 2019
work page 2019
-
[39]
What happens after SGD reaches zero loss? –a mathematical framework
Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? –a mathematical framework. InInternational Conference on Learning Representations, 2022
work page 2022
-
[40]
Gradient descent with large step sizes: Chaos and fractal convergence region
Shuang Liang and Guido Montufar. Gradient descent with large step sizes: Chaos and fractal convergence region. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[41]
Lennart Ljung. Analysis of recursive stochastic algorithms.IEEE transactions on automatic control, 22(4):551–575, 2003
work page 2003
-
[42]
Benign oscillation of stochastic gradient descent with large learning rate
Miao Lu, Beining Wu, Xiaodong Yang, and Difan Zou. Benign oscillation of stochastic gradient descent with large learning rate. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[43]
Pierre Marion and Lénaïc Chizat. Deep linear networks for regression are implicitly regularized towards flat minima.Advances in Neural Information Processing Systems, 37:76848–76900, 2024
work page 2024
-
[44]
David Meltzer, Min Chen, and Junyu Liu. Catapult dynamics and phase transi- tions in quadratic nets.Journal of Statistical Mechanics: Theory and Experiment, 2025(9):093406, 2025
work page 2025
-
[45]
Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020
work page 2020
-
[46]
Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity.Advances in Neural Information Processing Systems, 34:29218–29230, 2021
work page 2021
-
[47]
Layerwise recurrent router for mixture-of-experts
Zihan Qiu, Zeyu Huang, Shuang Cheng, Yizhi Zhou, Zili Wang, Ivan Titov, and Jie Fu. Layerwise recurrent router for mixture-of-experts. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[48]
A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
work page 1951
-
[49]
Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models.advances in neural information processing systems, 34:17555–17566, 2021
work page 2021
-
[50]
An overview of gradient descent optimization algorithms
Sebastian Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[51]
Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? InConference on Learning Theory, pages 2667–2690. PMLR, 2019. 34 Spikes in SGD: an LDP View Gess, Heydecker
work page 2019
-
[52]
Mofe: Mixture of frozen experts architecture
Jean Seo, Jaeyoon Kim, and Hyopil Shin. Mofe: Mixture of frozen experts architecture. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 340–348, 2025
work page 2025
-
[53]
Anna Shalova, André Schlichting, and Mark Peletier. Singular-limit analysis of gradient descent with noise injection.arXiv preprint arXiv:2404.12293, 2024
-
[54]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[55]
Super-convergence: Very fast training of neural networks using large learning rates
Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. InArtificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019
work page 2019
-
[56]
W Stummer and K-Th Sturm. On exponentials of additive functionals of markov processes.Stochastic processes and their applications, 85(1):45–60, 2000
work page 2000
-
[57]
Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan, et al. Every flop counts: Scaling a 300b mixture-of-experts ling llm without premium gpus.arXiv preprint arXiv:2503.05139, 2025
-
[58]
The rich and the simple: On the implicit bias of adam and sgd
Bhavya Vasudeva, Jung Whan Lee, Vatsal Sharan, and Mahdi Soltanolkotabi. The rich and the simple: On the implicit bias of adam and sgd. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[59]
Cambridge university press, 1991
David Williams.Probability with martingales. Cambridge university press, 1991
work page 1991
-
[60]
Francis Williams, Matthew Trager, Daniele Panozzo, Claudio Silva, Denis Zorin, and Joan Bruna. Gradient dynamics of shallow univariate relu networks.Advances in neural information processing systems, 32, 2019
work page 2019
-
[61]
Stephan Wojtowytsch. On the convergence of gradient descent training for two-layer relu-networks in the mean field regime.arXiv preprint arXiv:2005.13530, 2020
-
[62]
Lei Wu, Mingze Wang, and Weijie Su. The alignment property of sgd noise and how it helps select flat minima: A stability analysis.Advances in Neural Information Processing Systems, 35:4680–4693, 2022
work page 2022
-
[63]
Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InInternational Conference on Learning Representations, 2021
work page 2021
-
[64]
Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[65]
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. InBritish Machine Vision Conference 2016. British Machine Vision Association, 2016
work page 2016
-
[66]
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021
work page 2021
-
[67]
Quadratic models for understanding catapult dynamics of neural networks
Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, and Mikhail Belkin. Quadratic models for understanding catapult dynamics of neural networks. InThe Twelfth International Conference on Learning Representations
-
[68]
Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, and Mikhail Belkin. Cata- pults in SGD: spikes in the training loss and their impact on generalization through feature learning. InForty-first International Conference on Machine Learning, 2024. 35
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.