arxiv: 2603.10079 · v2 · submitted 2026-03-10 · 💻 cs.LG · math.PR

Recognition: 2 theorem links

· Lean Theorem

Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

Benjamin Gess , Daniel Heydecker

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:26 UTC · model grok-4.3

classification 💻 cs.LG math.PR

keywords stochastic gradient descentlarge deviationsloss spikescatapult phaseNTK scalingflat minimacurvature reduction

0 comments

The pith

Large loss spikes in SGD escape sharp minima and reduce curvature to favor flatter solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines large loss spikes in stochastic gradient descent using large-deviations analysis on shallow fully connected networks in the NTK scaling regime. It shows that the catapult phase splits into inflationary and deflationary regimes according to an explicit log-drift criterion, with spikes occurring at least with polynomial probability in both cases. These spikes are established as the dominant process for escaping sharp minima while lowering curvature, which promotes flatter solutions. The same conclusions are reached for certain ReLU networks, along with derived implications for curriculum learning.

Core claim

In the NTK scaling for shallow fully connected networks, large loss spikes during SGD occur with at least polynomial probability and form the dominant mechanism by which sharp minima are escaped and curvature is reduced, thereby favoring flatter solutions; the catapult phase splits into inflationary and deflationary regimes set by an explicit log-drift criterion.

What carries the argument

Large-deviations analysis that splits the catapult phase into inflationary and deflationary regimes via an explicit log-drift criterion and identifies spikes as the primary escape route from sharp minima.

If this is right

Spikes occur with polynomial probability in both inflationary and deflationary regimes.
Spikes dominate escape from sharp minima and the associated curvature reduction.
This process favors flatter solutions over sharp ones.
Analogous spike behavior and escape dynamics hold for certain ReLU networks.
Curriculum learning can be designed to exploit the identified regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If spike dominance extends to deeper networks, it may clarify why overparameterized models generalize despite non-convex loss landscapes.
Training runs could be monitored for the log-drift threshold to predict when spikes will appear and alter curvature.
The regime split suggests that learning-rate schedules inducing controlled spikes could improve final flatness without explicit regularization.

Load-bearing premise

The analysis is limited to shallow fully connected networks in the NTK scaling regime.

What would settle it

An empirical run on the same shallow network class in which large spikes are rare or fail to produce measurable curvature reduction would falsify the dominance claim.

Figures

Figures reproduced from arXiv: 2603.10079 by Benjamin Gess, Daniel Heydecker.

**Figure 2.** Figure 2: Plots of G(λ) for the examples (1.17 - 1.18) 1.4.1 The Phases with Two Datapoints We begin by discussing the structure of G and the different parameter regimes where the dataset consists only of two points. In [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: max(1, ϑ(λ)) (blue) and n −ϑ(λ)/2 with n = 1012 (red) for the dataset (1.28). Motivation of the Model: High Dimensionality and the Ubiquity of Spikes We first give a motivation of the models described by Theorem 1, as well as an informal argument that it partially explains the ubiquity of spikes in practice. In addition to the large numbers of parameters, the dimension d of the domain of a learned function… view at source ↗

read the original abstract

Large loss spikes in stochastic gradient descent are studied through a rigorous large-deviations analysis for a shallow, fully connected network in the NTK scaling. In contrast to full-batch gradient descent, the catapult phase is shown to split into inflationary and deflationary regimes, determined by an explicit log-drift criterion. In both cases, large spikes are shown to be at least polynomially likely. In addition, these spikes are shown to be the dominant mechanism by which sharp minima are escaped and curvature is reduced, thereby favouring flatter solutions. Corresponding results are also obtained for certain ReLU networks, and implications for curriculum learning are derived.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large-deviations analysis splits the catapult phase and gives polynomial spike probabilities for shallow NTK nets, but the dominance claim for escaping sharp minima needs an explicit rate comparison.

read the letter

The paper applies large-deviations theory to loss spikes in SGD on shallow fully-connected networks in the NTK regime. It introduces an explicit log-drift criterion that splits the catapult phase into inflationary and deflationary regimes, shows spikes occur with at least polynomial probability in both, and argues these spikes are the main route out of sharp minima toward flatter solutions. Corresponding results hold for some ReLU networks, with a side note on curriculum learning.

Referee Report

1 major / 1 minor

Summary. The paper conducts a rigorous large-deviations analysis of large loss spikes during stochastic gradient descent for shallow fully connected networks in the NTK scaling regime. It demonstrates that the catapult phase splits into inflationary and deflationary regimes according to an explicit log-drift criterion, shows that large spikes occur with at least polynomial probability in both regimes, and claims that these spikes constitute the dominant mechanism for escaping sharp minima, thereby reducing curvature and favoring flatter solutions. Analogous results are derived for certain ReLU networks, along with implications for curriculum learning.

Significance. If the central claims hold, particularly the dominance of spikes in escaping sharp minima, this work provides a valuable theoretical framework linking SGD dynamics to the preference for flat minima, which is often associated with better generalization. The explicit criteria and polynomial likelihood results strengthen the analysis, and the extension to ReLU networks and curriculum learning adds practical relevance. However, the restriction to shallow networks in the NTK regime limits immediate broader impact.

major comments (1)

[Abstract] The claim that spikes are the dominant mechanism for escaping sharp minima (Abstract) requires an explicit variational comparison: the infimum of the large-deviation rate function over spike trajectories must be shown to be strictly smaller than over non-spike alternatives such as gradual diffusion or noise-driven curvature reduction. The log-drift criterion and polynomial-likelihood results establish occurrence but do not include this rate-function comparison, leaving dominance dependent on an implicit assumption.

minor comments (1)

[Section 3] Clarify the precise statement of the log-drift criterion when transitioning between inflationary and deflationary regimes to ensure the splitting is unambiguous for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the single major comment below and will incorporate a revision to strengthen the dominance claim.

read point-by-point responses

Referee: [Abstract] The claim that spikes are the dominant mechanism for escaping sharp minima (Abstract) requires an explicit variational comparison: the infimum of the large-deviation rate function over spike trajectories must be shown to be strictly smaller than over non-spike alternatives such as gradual diffusion or noise-driven curvature reduction. The log-drift criterion and polynomial-likelihood results establish occurrence but do not include this rate-function comparison, leaving dominance dependent on an implicit assumption.

Authors: We agree that an explicit variational comparison would make the dominance claim fully rigorous. The current analysis shows that spikes occur with at least polynomial probability under the log-drift criterion in both regimes, and that the NTK dynamics make such jumps the natural escape route from sharp minima. However, we did not compute or bound the infima of the rate function over spike versus non-spike (e.g., gradual diffusion) paths. In the revision we will add this comparison, deriving or bounding the large-deviation rate for representative spike trajectories and showing it is strictly smaller than the rate for continuous paths that must fight the mean drift over long times. This will be placed in the large-deviations section and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external large-deviations theory

full rationale

The paper derives the inflationary/deflationary split via an explicit log-drift criterion and establishes polynomial likelihood of spikes using standard large-deviations principles applied to the NTK scaling regime for shallow FC networks. The dominance claim for spike-mediated escape is presented as a consequence of the rate-function analysis rather than reducing to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or steps exhibit the enumerated circular patterns; the work remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the NTK scaling limit for shallow networks and standard results from large-deviations theory; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption NTK scaling regime for shallow fully connected networks
Invoked to enable the large-deviations analysis of the loss dynamics.
standard math Standard large-deviations principles apply to the stochastic gradient process
Used to derive the probability of large spikes and the log-drift criterion.

pith-pipeline@v0.9.0 · 5397 in / 1233 out tokens · 49167 ms · 2026-05-15T13:26:17.905551+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_eq_pow, embed_add echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

μ(t)≈∏(1−ηλ0 si(t)²)μ0; log|μ(t)|≈log|μ0|+∑log|1−ηλ0 si(u)|; G(λ0)=∑pi log|1−ηλ0 si²|; ϑ(λ)=sup{θ:∑pi|1−ηλ si²|^θ≤1}
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel, Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

large spikes are, up to exponentially unlikely events, the only way … curvature reduced (Proposition 4.2); slow escape exponentially improbable

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 5 internal anchors

[1]

High dimensional analysis reveals conservative sharpening and a stochastic edge of stability.CoRR, abs/2404.19261, 2024

Atish Agarwala and Jeffrey Pennington. High dimensional analysis reveals conservative sharpening and a stochastic edge of stability.CoRR, abs/2404.19261, 2024

work page arXiv 2024
[2]

Edge of stochastic stability: Revisit- ing the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024

Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisit- ing the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024

work page arXiv 2024
[3]

Understanding gradient descent on the edge of stability in deep learning

Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. InInternational Conference on Machine Learning, pages 948–1024. PMLR, 2022

work page 2022
[4]

What is the long-run distribution of stochastic gradient descent? a large deviations analysis

Waïss Azizian, Franck Iutzeler, Jerome Malick, and Panayotis Mertikopoulos. What is the long-run distribution of stochastic gradient descent? a large deviations analysis. In International Conference on Machine Learning, pages 2168–2229. PMLR, 2024

work page 2024
[5]

The global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations

Waïss Azizian, Franck Iutzeler, Jerome Malick, and Panayotis Mertikopoulos. The global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations. InInternational Conference on Machine Learning, pages 1982–2044. PMLR, 2025

work page 1982
[6]

Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process

Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. InConference on learning theory, pages 483–513. PMLR, 2020

work page 2020
[7]

Early alignment in two-layer networks training is a two-edged sword.Journal of Machine Learning Research, 26(183):1–75, 2025

Etienne Boursier and Nicolas Flammarion. Early alignment in two-layer networks training is a two-edged sword.Journal of Machine Learning Research, 26(183):1–75, 2025

work page 2025
[8]

Simplicity bias and optimization threshold in two-layer relu networks

Etienne Boursier and Nicolas Flammarion. Simplicity bias and optimization threshold in two-layer relu networks. InICML 2025-International Conference on Machine Learning, volume 267, 2025

work page 2025
[9]

Characterizing dynamical stability of stochastic gradient descent in overparameterized learning.Journal of Machine Learning Research, 26(134):1–46, 2025

Dennis Chemnitz and Maximilian Engel. Characterizing dynamical stability of stochastic gradient descent in overparameterized learning.Journal of Machine Learning Research, 26(134):1–46, 2025

work page 2025
[10]

On lazy training in differentiable programming.Advances in neural information processing systems, 32, 2019

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming.Advances in neural information processing systems, 32, 2019

work page 2019
[11]

Stablemoe: Stable routing strategy for mixture of experts

Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. Stablemoe: Stable routing strategy for mixture of experts. InACL (1), pages 7085–7095, 2022. 32 Spikes in SGD: an LDP View Gess, Heydecker

work page 2022
[12]

Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594, 2022

Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594, 2022

work page arXiv 2022
[13]

Convergence of shallow relu networks on weakly interacting data

Léo Dana, Loucas Pillaud-Vivien, and Francis Bach. Convergence of shallow relu networks on weakly interacting data. InNeural Information Processing Systems 2025, 2025

work page 2025
[14]

Identifying and attacking the saddle point problem in high- dimensionalnon-convexoptimization.Advances in neural information processing systems, 27, 2014

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high- dimensionalnon-convexoptimization.Advances in neural information processing systems, 27, 2014

work page 2014
[15]

Springer Science & Business Media, 2009

Amir Dembo and Ofer Zeitouni.Large deviations techniques and applications, volume 38. Springer Science & Business Media, 2009

work page 2009
[16]

Gradient descent provably optimizes over-parameterized neural networks

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. InInternational Conference on Learning Representations

work page
[17]

(s) gd over diagonal linear networks: Implicit bias, large stepsizes and edge of stability.Advances in Neural Information Processing Systems, 36:29406–29448, 2023

Mathieu Even, Scott Pesme, Suriya Gunasekar, and Nicolas Flammarion. (s) gd over diagonal linear networks: Implicit bias, large stepsizes and edge of stability.Advances in Neural Information Processing Systems, 36:29406–29448, 2023

work page 2023
[18]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[19]

Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024

work page arXiv 2024
[20]

The early phase of neural network training

Jonathan Frankle, David J Schwab, and Ari S Morcos. The early phase of neural network training. InInternational Conference on Learning Representations

work page
[21]

Large-scale matrix factorization with distributed stochastic gradient descent

Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. InProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77, 2011

work page 2011
[22]

A loss curvature perspective on training instabilities of deep learning models

Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Edward Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instabilities of deep learning models. InInternational Conference on Learning Representations, 2022

work page 2022
[23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[24]

Flat minima.Neural computation, 9(1):1–42, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural computation, 9(1):1–42, 1997

work page 1997
[25]

American Mathematical Soc., 2000

Frank Hollander.Large deviations, volume 14. American Mathematical Soc., 2000

work page 2000
[26]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017
[27]

Quasi-potential theory for escape problem: Quantitative sharpness effect on sgd’s escape from local minima

Hikaru Ibayashi and Masaaki Imaizumi. Quasi-potential theory for escape problem: Quantitative sharpness effect on sgd’s escape from local minima. 2021

work page 2021
[28]

Neural tangent kernel: Convergence andgeneralizationinneuralnetworks.Advances in neural information processing systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence andgeneralizationinneuralnetworks.Advances in neural information processing systems, 31, 2018

work page 2018
[29]

Three Factors Influencing Minima in SGD

Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Improving Generalization Performance by Switching from Adam to SGD

Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd.arXiv preprint arXiv:1712.07628, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Springer, 2020

Guanghui Lan.First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020. 33 Spikes in SGD: an LDP View Gess, Heydecker

work page 2020
[32]

The two regimes of deep network training

Guillaume Leclerc and Aleksander Madry. The two regimes of deep network training. arXiv preprint arXiv:2002.10376, 2020

work page arXiv 2002
[33]

Deep learning.nature, 521(7553):436– 444, 2015

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436– 444, 2015

work page 2015
[34]

Efficient backprop

Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. InNeural networks: Tricks of the trade, pages 9–50. Springer, 2002

work page 2002
[35]

The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

work page arXiv 2003
[36]

Stochastic modified equations and adaptive stochastic gradient algorithms

Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and adaptive stochastic gradient algorithms. InInternational Conference on Machine Learning, pages 2101–2110. PMLR, 2017

work page 2017
[37]

Learning overparameterized neural networks via stochastic gradient descent on structured data.Advances in neural information processing systems, 31, 2018

Yuanzhi Liand Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data.Advances in neural information processing systems, 31, 2018

work page 2018
[38]

Towards explaining the regularization effect of initial large learning rate in training neural networks.Advances in neural information processing systems, 32, 2019

Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks.Advances in neural information processing systems, 32, 2019

work page 2019
[39]

What happens after SGD reaches zero loss? –a mathematical framework

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? –a mathematical framework. InInternational Conference on Learning Representations, 2022

work page 2022
[40]

Gradient descent with large step sizes: Chaos and fractal convergence region

Shuang Liang and Guido Montufar. Gradient descent with large step sizes: Chaos and fractal convergence region. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[41]

Analysis of recursive stochastic algorithms.IEEE transactions on automatic control, 22(4):551–575, 2003

Lennart Ljung. Analysis of recursive stochastic algorithms.IEEE transactions on automatic control, 22(4):551–575, 2003

work page 2003
[42]

Benign oscillation of stochastic gradient descent with large learning rate

Miao Lu, Beining Wu, Xiaodong Yang, and Difan Zou. Benign oscillation of stochastic gradient descent with large learning rate. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[43]

Deep linear networks for regression are implicitly regularized towards flat minima.Advances in Neural Information Processing Systems, 37:76848–76900, 2024

Pierre Marion and Lénaïc Chizat. Deep linear networks for regression are implicitly regularized towards flat minima.Advances in Neural Information Processing Systems, 37:76848–76900, 2024

work page 2024
[44]

Catapult dynamics and phase transi- tions in quadratic nets.Journal of Statistical Mechanics: Theory and Experiment, 2025(9):093406, 2025

David Meltzer, Min Chen, and Junyu Liu. Catapult dynamics and phase transi- tions in quadratic nets.Journal of Statistical Mechanics: Theory and Experiment, 2025(9):093406, 2025

work page 2025
[45]

On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems.Advances in Neural Information Processing Systems, 33:1117–1128, 2020

work page 2020
[46]

Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity.Advances in Neural Information Processing Systems, 34:29218–29230, 2021

Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity.Advances in Neural Information Processing Systems, 34:29218–29230, 2021

work page 2021
[47]

Layerwise recurrent router for mixture-of-experts

Zihan Qiu, Zeyu Huang, Shuang Cheng, Yizhi Zhou, Zili Wang, Ivan Titov, and Jie Fu. Layerwise recurrent router for mixture-of-experts. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[48]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

work page 1951
[49]

Hash layers for large sparse models.advances in neural information processing systems, 34:17555–17566, 2021

Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models.advances in neural information processing systems, 34:17555–17566, 2021

work page 2021
[50]

An overview of gradient descent optimization algorithms

Sebastian Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[51]

How do infinite width bounded norm networks look in function space? InConference on Learning Theory, pages 2667–2690

Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? InConference on Learning Theory, pages 2667–2690. PMLR, 2019. 34 Spikes in SGD: an LDP View Gess, Heydecker

work page 2019
[52]

Mofe: Mixture of frozen experts architecture

Jean Seo, Jaeyoon Kim, and Hyopil Shin. Mofe: Mixture of frozen experts architecture. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 340–348, 2025

work page 2025
[53]

Singular-limit analysis of gradient descent with noise injection.arXiv preprint arXiv:2404.12293, 2024

Anna Shalova, André Schlichting, and Mark Peletier. Singular-limit analysis of gradient descent with noise injection.arXiv preprint arXiv:2404.12293, 2024

work page arXiv 2024
[54]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[55]

Super-convergence: Very fast training of neural networks using large learning rates

Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. InArtificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019

work page 2019
[56]

On exponentials of additive functionals of markov processes.Stochastic processes and their applications, 85(1):45–60, 2000

W Stummer and K-Th Sturm. On exponentials of additive functionals of markov processes.Stochastic processes and their applications, 85(1):45–60, 2000

work page 2000
[57]

Every flop counts: Scaling a 300b mixture-of-experts ling llm without premium gpus.arXiv preprint arXiv:2503.05139, 2025

Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan, et al. Every flop counts: Scaling a 300b mixture-of-experts ling llm without premium gpus.arXiv preprint arXiv:2503.05139, 2025

work page arXiv 2025
[58]

The rich and the simple: On the implicit bias of adam and sgd

Bhavya Vasudeva, Jung Whan Lee, Vatsal Sharan, and Mahdi Soltanolkotabi. The rich and the simple: On the implicit bias of adam and sgd. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[59]

Cambridge university press, 1991

David Williams.Probability with martingales. Cambridge university press, 1991

work page 1991
[60]

Gradient dynamics of shallow univariate relu networks.Advances in neural information processing systems, 32, 2019

Francis Williams, Matthew Trager, Daniele Panozzo, Claudio Silva, Denis Zorin, and Joan Bruna. Gradient dynamics of shallow univariate relu networks.Advances in neural information processing systems, 32, 2019

work page 2019
[61]

On the convergence of gradient descent training for two-layer relu-networks in the mean field regime.arXiv preprint arXiv:2005.13530, 2020

Stephan Wojtowytsch. On the convergence of gradient descent training for two-layer relu-networks in the mean field regime.arXiv preprint arXiv:2005.13530, 2020

work page arXiv 2005
[62]

The alignment property of sgd noise and how it helps select flat minima: A stability analysis.Advances in Neural Information Processing Systems, 35:4680–4693, 2022

Lei Wu, Mingze Wang, and Weijie Su. The alignment property of sgd noise and how it helps select flat minima: A stability analysis.Advances in Neural Information Processing Systems, 35:4680–4693, 2022

work page 2022
[63]

A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InInternational Conference on Learning Representations, 2021

work page 2021
[64]

A Walk with SGD

Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[65]

Wide residual networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. InBritish Machine Vision Conference 2016. British Machine Vision Association, 2016

work page 2016
[66]

Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

work page 2021
[67]

Quadratic models for understanding catapult dynamics of neural networks

Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, and Mikhail Belkin. Quadratic models for understanding catapult dynamics of neural networks. InThe Twelfth International Conference on Learning Representations

work page
[68]

Cata- pults in SGD: spikes in the training loss and their impact on generalization through feature learning

Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, and Mikhail Belkin. Cata- pults in SGD: spikes in the training loss and their impact on generalization through feature learning. InForty-first International Conference on Machine Learning, 2024. 35

work page 2024