arxiv: 2604.19740 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.CV· stat.ML

Recognition: unknown

Generalization at the Edge of Stability

Mario Tuci , Caner Korkmaz , Umut \c{S}im\c{s}ekli , Tolga Birdal

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVstat.ML

keywords generalization boundsedge of stabilitysharpness dimensionHessian spectrumchaotic dynamicsrandom dynamical systemsneural network traininggrokking

0 comments

The pith

A sharpness dimension derived from the full Hessian spectrum bounds generalization when training at the edge of stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that stochastic optimizers at the edge of stability can be modeled as random dynamical systems converging to fractal attractors. From this, the authors define a sharpness dimension and prove it yields a generalization bound that requires the complete Hessian spectrum and its partial determinants. A sympathetic reader would care because this accounts for improved generalization under large learning rates, which prior bounds using only the trace or spectral norm cannot explain, and it supplies new insight into grokking.

Core claim

By representing stochastic optimizers as random dynamical systems that converge to a fractal attractor set with smaller intrinsic dimension, the authors introduce the sharpness dimension inspired by Lyapunov dimension theory. They prove a generalization bound in terms of this dimension, showing that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants in a way that cannot be captured by the trace or spectral norm alone. Experiments on MLPs and transformers support the bound and illuminate grokking.

What carries the argument

The sharpness dimension, the intrinsic dimension of the fractal attractor in the random dynamical system model of the optimizer, computed from the partial determinants of the Hessian spectrum.

If this is right

Generalization bounds in chaotic regimes must incorporate the full Hessian spectrum and partial determinant structure.
Training at the edge of stability reduces the effective sharpness dimension and thereby tightens the generalization bound.
Grokking arises as a transition that lowers the sharpness dimension over the course of training.
Bounds that rely solely on the trace or largest eigenvalue of the Hessian are incomplete for large learning rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimizers might be designed to steer the partial Hessian determinants toward smaller sharpness dimension values.
The same random dynamical system lens could be applied to analyze generalization in other iterative learning algorithms.
Tracking partial Hessian determinants during training could become a practical predictor of final generalization.

Load-bearing premise

Stochastic optimizers operating at the edge of stability can be represented as random dynamical systems that converge to a fractal attractor set whose intrinsic dimension is captured by the Hessian spectrum.

What would settle it

Computing the sharpness dimension from the Hessian spectrum for a set of models trained at the edge of stability and finding that the observed generalization error does not track the predicted bound would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.19740 by Caner Korkmaz, Mario Tuci, Tolga Birdal, Umut \c{S}im\c{s}ekli.

**Figure 1.** Figure 1: Generalization at the Edge of Stability (EoS). Modeling stochastic optimization as a random dynamical system (RDS), we show that at EoS the leading sharpness satisfies λ1 > 0, implying expansion along at least one direction. The fundamental balance between expansion and contraction implies that the effective dimensionality of the dynamics, measured by our Sharpness Dimension (SD), is strictly smaller than … view at source ↗

**Figure 2.** Figure 2: Correlations between various generalization indices and the empirical generalization gap on our small 3- layer MLP trained on MNIST dataset. The region indicated in green shows that our proposed Sharpness Dimension (SD) better predicts the generalization and loss gaps [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Kendall coefficients and their average granulated variant for the same 5-layer MLP trained using various batch sizes and learning rates. Our estimate further uses the SLQ approximation and its density estimating variant, SD-KDE. Generalization in larger 5-layer MLPs. To evaluate performance on larger networks, we consider a 5-layer MLP on MNIST with a width of 200, corresponding to 278,800 parameters. We… view at source ↗

**Figure 4.** Figure 4: Grokking analysis for different learning rates (η), weight decay (W D) and seeds across two architectures: (top row) 2-layer MLP. (second row) 2-layer MLP. Note that the suddenness of the grokking behavior is best captured in the complexity measures we introduce: RDS-Sharpness and Sharpness Dimension (SD). consider the task of arithmetic modulo 97, a family of supervised learning tasks where two one-hot en… view at source ↗

**Figure 5.** Figure 5: Correlation Matrices Corresponding to GPT-2 Trained on WikiText2-Dataset for different learning rates (LR), batch sizes (BS) and weight decay values (WD), across three different optimizers: SGD, SGD with momentum and AdamW. 6 Conclusion Training neural networks at the edge of numerical stability challenges classical generalization theories that assume convergence to a single solution. In this work, we show… view at source ↗

**Figure 6.** Figure 6: (a) Fractal Pullback Attractor at the Edge of Stability. Visualization of the random snapshot attractor A(ζ) generated by Stochastic Gradient Descent (SGD) on a non-linear function L(x) = 1 2 ∥x∥ 2 − A Q2 i=1 cos(kxi) (A = 2.0, k = 4.0). The system is evolved for T = 250 steps with a learning rate η = 0.15 placing the dynamics at the edge of stability. The figure illustrates the collapse of 6 × 105 particl… view at source ↗

**Figure 7.** Figure 7: Grokking analysis for different learning rates (η), weight decay (W D) and seeds 3-layer MLP with ReLU activation and no momentum. Note that the suddenness of the grokking behavior is best captured in the complexity measures we introduce: RDS-Sharpness and Sharpness Dimension (SD). Hessian Spectra & RDS Sharpness Spectrum [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗

read the original abstract

Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a sharpness dimension from the full Hessian spectrum and partial determinants to bound generalization at the edge of stability, but the random dynamical system modeling of SGD needs firmer grounding.

read the letter

The core contribution is a new sharpness dimension that uses the entire Hessian spectrum rather than just its trace or largest eigenvalue. The authors treat SGD at large learning rates as a random dynamical system that settles onto a fractal attractor, then define this dimension via Lyapunov-style ideas applied to partial determinants of the Hessian. They derive a generalization bound from it and run experiments on MLPs and transformers that also touch on grokking. That move past the usual sharpness scalars is the clearest novelty here, and the experiments give some indication that the dimension tracks generalization better in the chaotic regime than simpler measures do. The link to attractor dimension is a natural extension of existing edge-of-stability work if the details check out. The modeling step is the soft spot. The abstract states that trajectories converge to a fractal attractor with smaller intrinsic dimension, yet it is not obvious from the high-level claim how existence and convergence are established for actual SGD, or whether the partial-determinant construction satisfies the properties needed for a true dimension. If those steps rest on assumptions that are only approximately true or that depend on quantities fitted to the same runs, the bound could lose its force. The experiments would also benefit from clearer controls separating the dimension effect from other large-LR side effects. This is for readers already working on optimization dynamics, generalization bounds, or grokking. It brings a fresh construct worth discussing even if the foundations need tightening. Send it to review; the idea is specific enough that referees can check the attractor claims and the bound directly.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to model stochastic optimizers at the edge of stability as random dynamical systems that converge to fractal attractors of reduced intrinsic dimension. It introduces a 'sharpness dimension' constructed from the full Hessian spectrum and its partial determinants, proves a generalization bound based on this dimension, and shows through experiments on MLPs and transformers that generalization in the chaotic regime depends on this measure rather than trace or spectral norm, while also providing insights into grokking.

Significance. If the central results hold, the work provides a novel dynamical-systems perspective on generalization that incorporates the entire Hessian structure, going beyond prior sharpness measures. The experimental validation across architectures and the link to Lyapunov dimension theory are strengths that could explain empirical benefits of the edge-of-stability regime.

major comments (2)

[Theoretical framework and main theorem] The proof of the generalization bound assumes convergence of SGD dynamics at the edge of stability to a fractal attractor whose intrinsic dimension equals the proposed sharpness dimension, but no lemma or theorem establishes existence, uniqueness, or dimension reduction for the specific random dynamical system considered.
[Definition of sharpness dimension (around Eq. (5))] The sharpness dimension is defined via partial determinants of the Hessian spectrum; the manuscript does not verify that this construction satisfies monotonicity or countable stability, properties required for it to function as a dimension in the subsequent bound.

minor comments (2)

[Abstract and Experiments] The abstract refers to 'various MLPs and transformers' without listing the exact architectures, depths, or hyperparameter ranges used in the experiments; these details should appear in the experimental section for reproducibility.
[Notation and definitions] Notation for the partial determinants in the sharpness dimension could be clarified with a small worked example on a low-dimensional Hessian.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the work's significance. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Theoretical framework and main theorem] The proof of the generalization bound assumes convergence of SGD dynamics at the edge of stability to a fractal attractor whose intrinsic dimension equals the proposed sharpness dimension, but no lemma or theorem establishes existence, uniqueness, or dimension reduction for the specific random dynamical system considered.

Authors: We agree that the manuscript does not contain a dedicated lemma establishing existence, uniqueness, or dimension reduction for the specific random dynamical system modeling SGD at the edge of stability. The framework relies on this convergence as a modeling assumption, supported by empirical observations and connections to prior analyses of chaotic optimization dynamics. In the revision, we will explicitly flag this assumption in the theoretical framework section and add a new subsection with supporting numerical evidence from Lyapunov exponent computations and attractor dimension estimates on the considered models. We do not claim a full existence proof, which lies beyond the current scope. revision: partial
Referee: [Definition of sharpness dimension (around Eq. (5))] The sharpness dimension is defined via partial determinants of the Hessian spectrum; the manuscript does not verify that this construction satisfies monotonicity or countable stability, properties required for it to function as a dimension in the subsequent bound.

Authors: The sharpness dimension is constructed to parallel the Lyapunov dimension from dynamical systems theory, which satisfies monotonicity and countable stability. We will add a short proposition in the revised manuscript that directly verifies these properties for our definition by leveraging the ordering of Hessian eigenvalues and the multiplicative structure of the partial determinants. This verification follows standard arguments for spectrum-based dimensions and will be included prior to the generalization bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper models stochastic optimizers as random dynamical systems converging to fractal attractors, introduces the sharpness dimension (inspired by Lyapunov dimension theory and constructed from the full Hessian spectrum and partial determinants), and derives a generalization bound from this dimension. The abstract and description provide no quoted equations or steps showing self-definition (e.g., sharpness dimension defined in terms of the bound), fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to a tautology. The modeling assumption and subsequent mathematical derivation appear independent of the target generalization result, with no evidence of the patterns that would trigger a positive circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on representing stochastic gradient methods as random dynamical systems, the assumption that they converge to fractal attractors of reduced intrinsic dimension, and the subsequent definition of sharpness dimension from the Hessian spectrum; no free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Stochastic optimizers at the edge of stability converge to a fractal attractor set with smaller intrinsic dimension.
This modeling choice is the foundation for introducing the sharpness dimension and the generalization bound.

invented entities (1)

sharpness dimension no independent evidence
purpose: A dimension measure of the attractor used to bound generalization error.
Newly defined quantity inspired by Lyapunov dimension theory applied to the Hessian spectrum and partial determinants.

pith-pipeline@v0.9.0 · 5474 in / 1401 out tokens · 40757 ms · 2026-05-10T02:33:25.839516+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
cs.AI 2026-05 unverdicted novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...

Reference graph

Works this paper leans on

79 extracted references · 18 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

T., Suarez, F., and Zhang, Y

Ahn, K., Bubeck, S., Chewi, S., Lee, Y. T., Suarez, F., and Zhang, Y. (2023a). Learning threshold neurons via edge of stability. Advances in Neural Information Processing Systems , 36:19540--19569
[2]

Ahn, K., Jadbabaie, A., and Sra, S. (2023b). How to escape sharp minima with random perturbations. arXiv preprint arXiv:2305.15659

work page arXiv
[3]

Ahn, K., Zhang, J., and Sra, S. (2022). Understanding the unstable convergence of gradient descent. In International conference on machine learning . PMLR

2022
[4]

Andreeva, R., Dupuis, B., Sarkar, R., Birdal, T., and Simsekli, U. (2024). Topological generalization bounds for discrete-time stochastic optimization algorithms. Advances in Neural Information Processing Systems , 37

2024
[5]

and Beneventano, P

Andreyev, A. and Beneventano, P. (2024). Edge of stochastic stability: Revisiting the edge of stability for sgd. arXiv preprint arXiv:2412.20553

work page arXiv 2024
[6]

Arnold, L. (2006). Random dynamical systems. In Dynamical Systems: Lectures Given at the 2nd Session of the Centro Internazionale Matematico Estivo (CIME) held in Montecatini Terme, Italy, June 13--22, 1994 . Springer

2006
[7]

Arora, S., Li, Z., and Panigrahi, A. (2022). Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning , pages 948--1024. PMLR

2022
[8]

J., and Simsekli, U

Birdal, T., Lou, A., Guibas, L. J., and Simsekli, U. (2021). Intrinsic dimension, persistent homology and generalization in neural networks. Advances in Neural Information Processing Systems , 34:6776--6789

2021
[9]

Bogachev, V. (2007). Measure theory . Springer

2007
[10]

Cai, Y., Huang, H., Wen, H., Liu, D., Ma, Y., and Lyu, K. (2026). Does LLM pre-training typically occur at the edge of stability? In Workshop on Scientific Methods for Understanding Deep Learning

2026
[11]

A., Gurbuzbalaban, M., Simsekli, U., and Zhu, L

Camuto, A., Deligiannidis, G., Erdogdu, M. A., Gurbuzbalaban, M., Simsekli, U., and Zhu, L. (2021). Fractal structure and generalization properties of stochastic optimization algorithms. Advances in neural information processing systems , 34:18774--18788

2021
[12]

Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2019). Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment , 2019(12):124018

2019
[13]

and Engel, M

Chemnitz, D. and Engel, M. (2025). Characterizing dynamical stability of stochastic gradient descent in overparameterized learning. Journal of Machine Learning Research , 26(134):1--46

2025
[14]

and Bruna, J

Chen, L. and Bruna, J. (2023). Beyond the edge of stability via two-step gradient updates. In International Conference on Machine Learning , pages 4330--4391. PMLR

2023
[15]

Clerico, E., Farghly, T., Deligiannidis, G., Guedj, B., and Doucet, A. (2022). Generalisation under gradient descent via deterministic pac-bayes. arXiv preprint arXiv:2209.02525

work page arXiv 2022
[16]

arXiv preprint arXiv:2103.00065 , year=

Cohen, J. M., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. (2021). Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065

work page arXiv 2021
[17]

Crauel, H., Debussche, A., and Flandoli, F. (1997). Random attractors. Journal of Dynamics and Differential Equations , 9(2):307--341

1997
[18]

and Flandoli, F

Crauel, H. and Flandoli, F. (1994). Attractors for random dynamical systems. Probability Theory and Related Fields , 100(3):365--393

1994
[19]

Damian, A., Nichani, E., and Lee, J. D. (2022). Self-stabilization: The implicit bias of gradient descent at the edge of stability. arXiv preprint arXiv:2209.15594

work page arXiv 2022
[20]

Ding, L., Drusvyatskiy, D., Fazel, M., and Harchaoui, Z. (2024). Flat minima generalize for low-rank matrix recovery. Information and Inference: A Journal of the IMA

2024
[21]

Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In International Conference on Machine Learning , pages 1019--1028. PMLR

2017
[22]

Dupuis, B., Deligiannidis, G., and Simsekli, U. (2023). Generalization bounds using data-dependent fractal dimensions. In International conference on machine learning , pages 8922--8968. PMLR

2023
[23]

Dupuis, B., Viallard, P., Deligiannidis, G., and Simsekli, U. (2024). Uniform generalization bounds on data-dependent hypothesis sets via PAC -bayesian theory on random sets. Journal of Machine Learning Research , 25(409)

2024
[24]

and Simon, K

Feng, D.-J. and Simon, K. (2022). Dimension estimates for iterated function systems and repellers. part ii. Ergodic Theory and Dynamical Systems , 42(11):3357--3392

2022
[25]

Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. (2020). Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412

work page arXiv 2020
[26]

J., Greenberg, S., Kale, S., Luo, H., Mohri, M., and Sridharan, K

Foster, D. J., Greenberg, S., Kale, S., Luo, H., Mohri, M., and Sridharan, K. (2019). Hypothesis set stability and generalization. Advances in Neural Information Processing Systems , 32

2019
[27]

Gatmiry, K., Li, Z., Ma, T., Reddi, S., Jegelka, S., and Chuang, C.-Y. (2023). What is the inductive bias of flatness regularization? a study of deep matrix factorization models. Advances in Neural Information Processing Systems , 36:28040--28052

2023
[28]

Ghorbani, B., Krishnan, S., and Xiao, Y. (2019). An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning , pages 2232--2241. PMLR

2019
[29]

E., and M \"o llenhoff, T

Ghosh, A., Cong, B., Yokota, R., Ravishankar, S., Wang, R., Tao, M., Khan, M. E., and M \"o llenhoff, T. (2025). Variational learning finds flatter solutions at the edge of stability. arXiv preprint arXiv:2506.12903

work page arXiv 2025
[30]

Golub, G. H. and Welsch, J. H. (1969). Calculation of gauss quadrature rules. Mathematics of computation , 23

1969
[31]

Haddouche, M., Viallard, P., Simsekli, U., and Guedj, B. (2024). A pac-bayesian link between generalisation and flat minima. arXiv preprint arXiv:2402.08508

work page arXiv 2024
[32]

Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review , 53(2):217--288

2011
[33]

and Schmidhuber, J

Hochreiter, S. and Schmidhuber, J. (1994). Simplifying neural nets by discovering flat minima. Advances in neural information processing systems , 7

1994
[34]

Hodgkinson, L., Simsekli, U., Khanna, R., and Mahoney, M. (2022). Generalization bounds using lower tail exponents in stochastic optimizers. In International Conference on Machine Learning , pages 8774--8795. PMLR

2022
[35]

Hunt, B. R. (1996). Maximum local lyapunov dimension bounds the box dimension of chaotic attractors. Nonlinearity , 9(4):845

1996
[36]

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407

work page Pith review arXiv 2018
[37]

Jastrzebski, S., Szymczak, M., Fort, S., Arpit, D., Tabor, J., Cho, K., and Geras, K. (2020). The break-even point on optimization trajectories of deep neural networks. arXiv preprint arXiv:2002.09572

work page arXiv 2020
[38]

Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. (2019a). Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178

work page arXiv 1912
[39]

Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. (2019b). Fantastic Generalization Measures and Where to Find Them . ICLR 2020

2020
[40]

Kaddour, J., Liu, L., Silva, R., and Kusner, M. J. (2022). When do flat minima optimizers work? Advances in Neural Information Processing Systems , 35:16577--16595

2022
[41]

Kaplan, J. L. and Yorke, J. A. (2006). Chaotic behavior of multidimensional difference equations. In Functional Differential Equations and Approximation of Fixed Points: Proceedings, Bonn, July 1978 . Springer

2006
[42]

Kaur, S., Cohen, J., and Lipton, Z. C. (2023). On the maximum hessian eigenvalue and generalization. In Proceedings on , pages 51--65. PMLR

2023
[43]

Kendall, M. G. (1938). A new reasure of rank correlation. Biometrika

1938
[44]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836

work page internal anchor Pith review arXiv 2016
[45]

Lanczos, C. (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of research of the National Bureau of Standards , 45(4):255--282

1950
[46]

Lin, L., Saad, Y., and Yang, C. (2016). Approximating spectral densities of large matrices. SIAM review , 58(1):34--65

2016
[47]

M., Li, Z., and Ma, T

Liu, H., Xie, S. M., Li, Z., and Ma, T. (2023). Same pre-training loss, better downstream: Implicit bias matters for language models. In International Conference on Machine Learning , pages 22188--22214. PMLR

2023
[48]

and Hutter, F

Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization

2019
[49]

and Gong, P

Ly, A. and Gong, P. (2025). Optimization on multifractal loss landscapes explains a diverse range of geometrical and dynamical properties of deep learning. Nature Communications , 16(1):3252

2025
[50]

and Ying, L

Ma, C. and Ying, L. (2021). On linear stability of sgd and input-smoothness of neural networks. Advances in Neural Information Processing Systems , 34:16805--16817

2021
[51]

Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture models

2016
[52]

Molchanov, I. (2017). Theory of Random Sets . Number 87 in Probability Theory and Stochastic Modeling. Springer, second edition edition

2017
[53]

and Michaeli, T

Mulayoff, R. and Michaeli, T. (2020). Unique properties of flat minima in deep networks. In International conference on machine learning , pages 7108--7118. PMLR

2020
[54]

Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations

2023
[55]

Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploring generalization in deep learning. Advances in neural information processing systems , 30

2017
[56]

H., Simsekli, U., Gurbuzbalaban, M., and Richard, G

Nguyen, T. H., Simsekli, U., Gurbuzbalaban, M., and Richard, G. (2019). First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. Advances in neural information processing systems , 32

2019
[57]

Papyan, V. (2018). The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062

work page arXiv 2018
[58]

A., Hoover, W

Posch, H. A., Hoover, W. G., and Vesely, F. J. (1986). Canonical dynamics of the nos \'e oscillator: Stability, order, and chaos. Physical review A , 33(6):4253

1986
[59]

Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177

work page internal anchor Pith review arXiv 2022
[60]

Prieto, L., Barsbey, M., Mediano, P. A. M., and Birdal, T. (2025). Grokking at the edge of numerical stability. In The Thirteenth International Conference on Learning Representations

2025
[61]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners

2019
[62]

Rubin, N., Seroussi, I., and Ringel, Z. (2024). Grokking as a first order phase transition in two layer networks. In The Twelfth International Conference on Learning Representations

2024
[63]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., and Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454

work page Pith review arXiv 2017
[64]

Sasdelli, M., Ajanthan, T., Chin, T.-J., and Carneiro, G. (2021). A chaos theory approach to understand neural network optimization. In 2021 Digital Image Computing: Techniques and Applications (DICTA) , pages 1--10. IEEE

2021
[65]

Simsekli, U., Sagun, L., and Gurbuzbalaban, M. (2019). A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning , pages 5827--5837. PMLR

2019
[66]

Simsekli, U., Sener, O., Deligiannidis, G., and Erdogdu, M. A. (2020). Hausdorff dimension, heavy tails, and generalization in neural networks. Advances in Neural Information Processing Systems , 33:5138--5151

2020
[67]

Singh Kalra, D., He, T., and Barkeshli, M. (2023). Universal sharpness dynamics in neural network training: Fixed point analysis, edge of stability, and route to chaos. arXiv e-prints , pages arXiv--2311

2023
[68]

Tsuzuku, Y., Sato, I., and Sugiyama, M. (2020). Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using pac-bayesian analysis. In International Conference on Machine Learning , pages 9636--9647. PMLR

2020
[69]

Tuci, M., Bastian, L., Dupuis, B., Navab, N., Birdal, T., and S im s ekli, U. (2025). Mutual information free topological generalization bounds via stability. arXiv preprint arXiv:2507.06775

work page arXiv 2025
[70]

and Harremos, P

Van Erven, T. and Harremos, P. (2014). R \'e nyi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory , 60(7):3797--3820

2014
[71]

Wang, Z., Li, Z., and Li, J. (2022). Analyzing sharpness along gd trajectory: Progressive sharpening and edge of stability. Advances in Neural Information Processing Systems , 35:9983--9994

2022
[72]

Wen, K., Li, Z., and Ma, T. (2023). Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. Advances in Neural Information Processing Systems , 36:1024--1035

2023
[73]

Wu, D., Xia, S.-T., and Wang, Y. (2020). Adversarial weight perturbation helps robust generalization. Advances in neural information processing systems , 33:2958--2969

2020
[74]

Xing, C., Arpit, D., Tsirigotis, C., and Bengio, Y. (2018). A walk with sgd. arXiv preprint arXiv:1802.08770

work page arXiv 2018
[75]

Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, M. W. (2018). Hessian-based analysis of large batch training and robustness to adversaries. Advances in Neural Information Processing Systems , 31

2018
[76]

Yunis, D. (2017). The birkhoff ergodic theorem with applications. The University of Chicago. Dispon vel em

2017
[77]

Zhang, Y., Chen, C., Ding, T., Li, Z., Sun, R., and Luo, Z. (2024). Why transformers need adam: A hessian perspective. Advances in neural information processing systems , 37:131786--131823

2024
[78]

Zheng, Y., Zhang, R., and Mao, Y. (2021). Regularizing neural networks via adversarial model perturbation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8156--8165

2021
[79]

Zhu, X., Wang, Z., Wang, X., Zhou, M., and Ge, R. (2022). Understanding edge-of-stability training dynamics with a minimalist example. arXiv preprint arXiv:2210.03294

work page arXiv 2022