pith. machine review for the scientific record. sign in

arxiv: 2603.10079 · v2 · submitted 2026-03-10 · 💻 cs.LG · math.PR

Recognition: 2 theorem links

· Lean Theorem

Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:26 UTC · model grok-4.3

classification 💻 cs.LG math.PR
keywords stochastic gradient descentlarge deviationsloss spikescatapult phaseNTK scalingflat minimacurvature reduction
0
0 comments X

The pith

Large loss spikes in SGD escape sharp minima and reduce curvature to favor flatter solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines large loss spikes in stochastic gradient descent using large-deviations analysis on shallow fully connected networks in the NTK scaling regime. It shows that the catapult phase splits into inflationary and deflationary regimes according to an explicit log-drift criterion, with spikes occurring at least with polynomial probability in both cases. These spikes are established as the dominant process for escaping sharp minima while lowering curvature, which promotes flatter solutions. The same conclusions are reached for certain ReLU networks, along with derived implications for curriculum learning.

Core claim

In the NTK scaling for shallow fully connected networks, large loss spikes during SGD occur with at least polynomial probability and form the dominant mechanism by which sharp minima are escaped and curvature is reduced, thereby favoring flatter solutions; the catapult phase splits into inflationary and deflationary regimes set by an explicit log-drift criterion.

What carries the argument

Large-deviations analysis that splits the catapult phase into inflationary and deflationary regimes via an explicit log-drift criterion and identifies spikes as the primary escape route from sharp minima.

If this is right

  • Spikes occur with polynomial probability in both inflationary and deflationary regimes.
  • Spikes dominate escape from sharp minima and the associated curvature reduction.
  • This process favors flatter solutions over sharp ones.
  • Analogous spike behavior and escape dynamics hold for certain ReLU networks.
  • Curriculum learning can be designed to exploit the identified regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If spike dominance extends to deeper networks, it may clarify why overparameterized models generalize despite non-convex loss landscapes.
  • Training runs could be monitored for the log-drift threshold to predict when spikes will appear and alter curvature.
  • The regime split suggests that learning-rate schedules inducing controlled spikes could improve final flatness without explicit regularization.

Load-bearing premise

The analysis is limited to shallow fully connected networks in the NTK scaling regime.

What would settle it

An empirical run on the same shallow network class in which large spikes are rare or fail to produce measurable curvature reduction would falsify the dominance claim.

Figures

Figures reproduced from arXiv: 2603.10079 by Benjamin Gess, Daniel Heydecker.

Figure 1
Figure 1. Figure 1: Extension of the phase diagram [67, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Plots of G(λ) for the examples (1.17 - 1.18) 1.4.1 The Phases with Two Datapoints We begin by discussing the structure of G and the different parameter regimes where the dataset consists only of two points. In [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: max(1, ϑ(λ)) (blue) and n −ϑ(λ)/2 with n = 1012 (red) for the dataset (1.28). Motivation of the Model: High Dimensionality and the Ubiquity of Spikes We first give a motivation of the models described by Theorem 1, as well as an informal argument that it partially explains the ubiquity of spikes in practice. In addition to the large numbers of parameters, the dimension d of the domain of a learned function… view at source ↗
read the original abstract

Large loss spikes in stochastic gradient descent are studied through a rigorous large-deviations analysis for a shallow, fully connected network in the NTK scaling. In contrast to full-batch gradient descent, the catapult phase is shown to split into inflationary and deflationary regimes, determined by an explicit log-drift criterion. In both cases, large spikes are shown to be at least polynomially likely. In addition, these spikes are shown to be the dominant mechanism by which sharp minima are escaped and curvature is reduced, thereby favouring flatter solutions. Corresponding results are also obtained for certain ReLU networks, and implications for curriculum learning are derived.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper conducts a rigorous large-deviations analysis of large loss spikes during stochastic gradient descent for shallow fully connected networks in the NTK scaling regime. It demonstrates that the catapult phase splits into inflationary and deflationary regimes according to an explicit log-drift criterion, shows that large spikes occur with at least polynomial probability in both regimes, and claims that these spikes constitute the dominant mechanism for escaping sharp minima, thereby reducing curvature and favoring flatter solutions. Analogous results are derived for certain ReLU networks, along with implications for curriculum learning.

Significance. If the central claims hold, particularly the dominance of spikes in escaping sharp minima, this work provides a valuable theoretical framework linking SGD dynamics to the preference for flat minima, which is often associated with better generalization. The explicit criteria and polynomial likelihood results strengthen the analysis, and the extension to ReLU networks and curriculum learning adds practical relevance. However, the restriction to shallow networks in the NTK regime limits immediate broader impact.

major comments (1)
  1. [Abstract] The claim that spikes are the dominant mechanism for escaping sharp minima (Abstract) requires an explicit variational comparison: the infimum of the large-deviation rate function over spike trajectories must be shown to be strictly smaller than over non-spike alternatives such as gradual diffusion or noise-driven curvature reduction. The log-drift criterion and polynomial-likelihood results establish occurrence but do not include this rate-function comparison, leaving dominance dependent on an implicit assumption.
minor comments (1)
  1. [Section 3] Clarify the precise statement of the log-drift criterion when transitioning between inflationary and deflationary regimes to ensure the splitting is unambiguous for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the single major comment below and will incorporate a revision to strengthen the dominance claim.

read point-by-point responses
  1. Referee: [Abstract] The claim that spikes are the dominant mechanism for escaping sharp minima (Abstract) requires an explicit variational comparison: the infimum of the large-deviation rate function over spike trajectories must be shown to be strictly smaller than over non-spike alternatives such as gradual diffusion or noise-driven curvature reduction. The log-drift criterion and polynomial-likelihood results establish occurrence but do not include this rate-function comparison, leaving dominance dependent on an implicit assumption.

    Authors: We agree that an explicit variational comparison would make the dominance claim fully rigorous. The current analysis shows that spikes occur with at least polynomial probability under the log-drift criterion in both regimes, and that the NTK dynamics make such jumps the natural escape route from sharp minima. However, we did not compute or bound the infima of the rate function over spike versus non-spike (e.g., gradual diffusion) paths. In the revision we will add this comparison, deriving or bounding the large-deviation rate for representative spike trajectories and showing it is strictly smaller than the rate for continuous paths that must fight the mean drift over long times. This will be placed in the large-deviations section and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external large-deviations theory

full rationale

The paper derives the inflationary/deflationary split via an explicit log-drift criterion and establishes polynomial likelihood of spikes using standard large-deviations principles applied to the NTK scaling regime for shallow FC networks. The dominance claim for spike-mediated escape is presented as a consequence of the rate-function analysis rather than reducing to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or steps exhibit the enumerated circular patterns; the work remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the NTK scaling limit for shallow networks and standard results from large-deviations theory; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption NTK scaling regime for shallow fully connected networks
    Invoked to enable the large-deviations analysis of the loss dynamics.
  • standard math Standard large-deviations principles apply to the stochastic gradient process
    Used to derive the probability of large spikes and the log-drift criterion.

pith-pipeline@v0.9.0 · 5397 in / 1233 out tokens · 49167 ms · 2026-05-15T13:26:17.905551+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 5 internal anchors

  1. [1]

    High dimensional analysis reveals conservative sharpening and a stochastic edge of stability.CoRR, abs/2404.19261, 2024

    Atish Agarwala and Jeffrey Pennington. High dimensional analysis reveals conservative sharpening and a stochastic edge of stability.CoRR, abs/2404.19261, 2024

  2. [2]

    Edge of stochastic stability: Revisit- ing the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024

    Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisit- ing the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024

  3. [3]

    Understanding gradient descent on the edge of stability in deep learning

    Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. InInternational Conference on Machine Learning, pages 948–1024. PMLR, 2022

  4. [4]

    What is the long-run distribution of stochastic gradient descent? a large deviations analysis

    Waïss Azizian, Franck Iutzeler, Jerome Malick, and Panayotis Mertikopoulos. What is the long-run distribution of stochastic gradient descent? a large deviations analysis. In International Conference on Machine Learning, pages 2168–2229. PMLR, 2024

  5. [5]

    The global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations

    Waïss Azizian, Franck Iutzeler, Jerome Malick, and Panayotis Mertikopoulos. The global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations. InInternational Conference on Machine Learning, pages 1982–2044. PMLR, 2025

  6. [6]

    Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process

    Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. InConference on learning theory, pages 483–513. PMLR, 2020

  7. [7]

    Early alignment in two-layer networks training is a two-edged sword.Journal of Machine Learning Research, 26(183):1–75, 2025

    Etienne Boursier and Nicolas Flammarion. Early alignment in two-layer networks training is a two-edged sword.Journal of Machine Learning Research, 26(183):1–75, 2025

  8. [8]

    Simplicity bias and optimization threshold in two-layer relu networks

    Etienne Boursier and Nicolas Flammarion. Simplicity bias and optimization threshold in two-layer relu networks. InICML 2025-International Conference on Machine Learning, volume 267, 2025

  9. [9]

    Characterizing dynamical stability of stochastic gradient descent in overparameterized learning.Journal of Machine Learning Research, 26(134):1–46, 2025

    Dennis Chemnitz and Maximilian Engel. Characterizing dynamical stability of stochastic gradient descent in overparameterized learning.Journal of Machine Learning Research, 26(134):1–46, 2025

  10. [10]

    On lazy training in differentiable programming.Advances in neural information processing systems, 32, 2019

    Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming.Advances in neural information processing systems, 32, 2019

  11. [11]

    Stablemoe: Stable routing strategy for mixture of experts

    Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. Stablemoe: Stable routing strategy for mixture of experts. InACL (1), pages 7085–7095, 2022. 32 Spikes in SGD: an LDP View Gess, Heydecker

  12. [12]

    Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594, 2022

    Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594, 2022

  13. [13]

    Convergence of shallow relu networks on weakly interacting data

    Léo Dana, Loucas Pillaud-Vivien, and Francis Bach. Convergence of shallow relu networks on weakly interacting data. InNeural Information Processing Systems 2025, 2025

  14. [14]

    Identifying and attacking the saddle point problem in high- dimensionalnon-convexoptimization.Advances in neural information processing systems, 27, 2014

    Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high- dimensionalnon-convexoptimization.Advances in neural information processing systems, 27, 2014

  15. [15]

    Springer Science & Business Media, 2009

    Amir Dembo and Ofer Zeitouni.Large deviations techniques and applications, volume 38. Springer Science & Business Media, 2009

  16. [16]

    Gradient descent provably optimizes over-parameterized neural networks

    Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. InInternational Conference on Learning Representations

  17. [17]

    (s) gd over diagonal linear networks: Implicit bias, large stepsizes and edge of stability.Advances in Neural Information Processing Systems, 36:29406–29448, 2023

    Mathieu Even, Scott Pesme, Suriya Gunasekar, and Nicolas Flammarion. (s) gd over diagonal linear networks: Implicit bias, large stepsizes and edge of stability.Advances in Neural Information Processing Systems, 36:29406–29448, 2023

  18. [18]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  19. [19]

    Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024

    Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024

  20. [20]

    The early phase of neural network training

    Jonathan Frankle, David J Schwab, and Ari S Morcos. The early phase of neural network training. InInternational Conference on Learning Representations

  21. [21]

    Large-scale matrix factorization with distributed stochastic gradient descent

    Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. InProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77, 2011

  22. [22]

    A loss curvature perspective on training instabilities of deep learning models

    Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Edward Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instabilities of deep learning models. InInternational Conference on Learning Representations, 2022

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  24. [24]

    Flat minima.Neural computation, 9(1):1–42, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural computation, 9(1):1–42, 1997

  25. [25]

    American Mathematical Soc., 2000

    Frank Hollander.Large deviations, volume 14. American Mathematical Soc., 2000

  26. [26]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

  27. [27]

    Quasi-potential theory for escape problem: Quantitative sharpness effect on sgd’s escape from local minima

    Hikaru Ibayashi and Masaaki Imaizumi. Quasi-potential theory for escape problem: Quantitative sharpness effect on sgd’s escape from local minima. 2021

  28. [28]

    Neural tangent kernel: Convergence andgeneralizationinneuralnetworks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence andgeneralizationinneuralnetworks.Advances in neural information processing systems, 31, 2018

  29. [29]

    Three Factors Influencing Minima in SGD

    Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017

  30. [30]

    Improving Generalization Performance by Switching from Adam to SGD

    Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd.arXiv preprint arXiv:1712.07628, 2017

  31. [31]

    Springer, 2020

    Guanghui Lan.First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020. 33 Spikes in SGD: an LDP View Gess, Heydecker

  32. [32]

    The two regimes of deep network training

    Guillaume Leclerc and Aleksander Madry. The two regimes of deep network training. arXiv preprint arXiv:2002.10376, 2020

  33. [33]

    Deep learning.nature, 521(7553):436– 444, 2015

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436– 444, 2015

  34. [34]

    Efficient backprop

    Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. InNeural networks: Tricks of the trade, pages 9–50. Springer, 2002

  35. [35]

    The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

    Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

  36. [36]

    Stochastic modified equations and adaptive stochastic gradient algorithms

    Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and adaptive stochastic gradient algorithms. InInternational Conference on Machine Learning, pages 2101–2110. PMLR, 2017

  37. [37]

    Learning overparameterized neural networks via stochastic gradient descent on structured data.Advances in neural information processing systems, 31, 2018

    Yuanzhi Liand Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data.Advances in neural information processing systems, 31, 2018

  38. [38]

    Towards explaining the regularization effect of initial large learning rate in training neural networks.Advances in neural information processing systems, 32, 2019

    Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks.Advances in neural information processing systems, 32, 2019

  39. [39]

    What happens after SGD reaches zero loss? –a mathematical framework

    Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? –a mathematical framework. InInternational Conference on Learning Representations, 2022

  40. [40]

    Gradient descent with large step sizes: Chaos and fractal convergence region

    Shuang Liang and Guido Montufar. Gradient descent with large step sizes: Chaos and fractal convergence region. InThe Fourteenth International Conference on Learning Representations, 2026

  41. [41]

    Analysis of recursive stochastic algorithms.IEEE transactions on automatic control, 22(4):551–575, 2003

    Lennart Ljung. Analysis of recursive stochastic algorithms.IEEE transactions on automatic control, 22(4):551–575, 2003

  42. [42]

    Benign oscillation of stochastic gradient descent with large learning rate

    Miao Lu, Beining Wu, Xiaodong Yang, and Difan Zou. Benign oscillation of stochastic gradient descent with large learning rate. InThe Twelfth International Conference on Learning Representations, 2023

  43. [43]

    Deep linear networks for regression are implicitly regularized towards flat minima.Advances in Neural Information Processing Systems, 37:76848–76900, 2024

    Pierre Marion and Lénaïc Chizat. Deep linear networks for regression are implicitly regularized towards flat minima.Advances in Neural Information Processing Systems, 37:76848–76900, 2024

  44. [44]

    Catapult dynamics and phase transi- tions in quadratic nets.Journal of Statistical Mechanics: Theory and Experiment, 2025(9):093406, 2025

    David Meltzer, Min Chen, and Junyu Liu. Catapult dynamics and phase transi- tions in quadratic nets.Journal of Statistical Mechanics: Theory and Experiment, 2025(9):093406, 2025

  45. [45]

    On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

    Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

  46. [46]

    Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity.Advances in Neural Information Processing Systems, 34:29218–29230, 2021

    Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity.Advances in Neural Information Processing Systems, 34:29218–29230, 2021

  47. [47]

    Layerwise recurrent router for mixture-of-experts

    Zihan Qiu, Zeyu Huang, Shuang Cheng, Yizhi Zhou, Zili Wang, Ivan Titov, and Jie Fu. Layerwise recurrent router for mixture-of-experts. InThe Thirteenth International Conference on Learning Representations, 2025

  48. [48]

    A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

  49. [49]

    Hash layers for large sparse models.advances in neural information processing systems, 34:17555–17566, 2021

    Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models.advances in neural information processing systems, 34:17555–17566, 2021

  50. [50]

    An overview of gradient descent optimization algorithms

    Sebastian Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016

  51. [51]

    How do infinite width bounded norm networks look in function space? InConference on Learning Theory, pages 2667–2690

    Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? InConference on Learning Theory, pages 2667–2690. PMLR, 2019. 34 Spikes in SGD: an LDP View Gess, Heydecker

  52. [52]

    Mofe: Mixture of frozen experts architecture

    Jean Seo, Jaeyoon Kim, and Hyopil Shin. Mofe: Mixture of frozen experts architecture. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 340–348, 2025

  53. [53]

    Singular-limit analysis of gradient descent with noise injection.arXiv preprint arXiv:2404.12293, 2024

    Anna Shalova, André Schlichting, and Mark Peletier. Singular-limit analysis of gradient descent with noise injection.arXiv preprint arXiv:2404.12293, 2024

  54. [54]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  55. [55]

    Super-convergence: Very fast training of neural networks using large learning rates

    Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. InArtificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019

  56. [56]

    On exponentials of additive functionals of markov processes.Stochastic processes and their applications, 85(1):45–60, 2000

    W Stummer and K-Th Sturm. On exponentials of additive functionals of markov processes.Stochastic processes and their applications, 85(1):45–60, 2000

  57. [57]

    Every flop counts: Scaling a 300b mixture-of-experts ling llm without premium gpus.arXiv preprint arXiv:2503.05139, 2025

    Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan, et al. Every flop counts: Scaling a 300b mixture-of-experts ling llm without premium gpus.arXiv preprint arXiv:2503.05139, 2025

  58. [58]

    The rich and the simple: On the implicit bias of adam and sgd

    Bhavya Vasudeva, Jung Whan Lee, Vatsal Sharan, and Mahdi Soltanolkotabi. The rich and the simple: On the implicit bias of adam and sgd. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  59. [59]

    Cambridge university press, 1991

    David Williams.Probability with martingales. Cambridge university press, 1991

  60. [60]

    Gradient dynamics of shallow univariate relu networks.Advances in neural information processing systems, 32, 2019

    Francis Williams, Matthew Trager, Daniele Panozzo, Claudio Silva, Denis Zorin, and Joan Bruna. Gradient dynamics of shallow univariate relu networks.Advances in neural information processing systems, 32, 2019

  61. [61]

    On the convergence of gradient descent training for two-layer relu-networks in the mean field regime.arXiv preprint arXiv:2005.13530, 2020

    Stephan Wojtowytsch. On the convergence of gradient descent training for two-layer relu-networks in the mean field regime.arXiv preprint arXiv:2005.13530, 2020

  62. [62]

    The alignment property of sgd noise and how it helps select flat minima: A stability analysis.Advances in Neural Information Processing Systems, 35:4680–4693, 2022

    Lei Wu, Mingze Wang, and Weijie Su. The alignment property of sgd noise and how it helps select flat minima: A stability analysis.Advances in Neural Information Processing Systems, 35:4680–4693, 2022

  63. [63]

    A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

    Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InInternational Conference on Learning Representations, 2021

  64. [64]

    A Walk with SGD

    Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018

  65. [65]

    Wide residual networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. InBritish Machine Vision Conference 2016. British Machine Vision Association, 2016

  66. [66]

    Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

  67. [67]

    Quadratic models for understanding catapult dynamics of neural networks

    Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, and Mikhail Belkin. Quadratic models for understanding catapult dynamics of neural networks. InThe Twelfth International Conference on Learning Representations

  68. [68]

    Cata- pults in SGD: spikes in the training loss and their impact on generalization through feature learning

    Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, and Mikhail Belkin. Cata- pults in SGD: spikes in the training loss and their impact on generalization through feature learning. InForty-first International Conference on Machine Learning, 2024. 35