pith. sign in

arxiv: 1907.10732 · v1 · pith:4KB2W4QNnew · submitted 2019-07-24 · 💻 cs.LG · stat.ML

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

Pith reviewed 2026-05-24 16:40 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords SGDHessiandeep neural networksoptimization dynamicsgeneralization boundsstochastic gradientsscale invarianceadaptive step sizes
0
0 comments X

The pith

The Hessian of the training loss characterizes SGD dynamics through gradient moments and yields a scale-invariant generalization bound for deep nets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates three questions about SGD in deep networks by relating quantities derived from the Hessian of the training loss to the first and second moments of stochastic gradients. It characterizes the trajectories of SGD under fixed step sizes, adaptive step sizes, and diagonal preconditioning. It also constructs a generalization bound that stays invariant under parameter rescaling even though the Hessian itself is not. A reader would care because these relations tie the curvature of the loss directly to both the path taken during training and the final performance on unseen data.

Core claim

The authors show that the Hessian of the training loss is linked to the second moment of stochastic gradients, which in turn governs the stochastic dynamics of SGD for fixed and adaptive step sizes with diagonal preconditioning. They further derive a generalization bound expressed in terms of the Hessian that is invariant to scaling of the network parameters, supported by experiments on synthetic data, MNIST, and CIFAR-10 across varying batch sizes and label noise levels.

What carries the argument

The Hessian matrix of the training loss, which connects loss curvature to the second moment of stochastic gradients and supplies the basis for a scale-invariant bound.

If this is right

  • SGD with fixed step sizes follows dynamics determined by the first and second moments of stochastic gradients.
  • Adaptive step sizes and diagonal preconditioning admit analogous characterizations using the same moments.
  • A generalization bound for deep nets can be stated directly from the Hessian in a form that remains unchanged under parameter scaling.
  • Empirical verification on MNIST and CIFAR-10 across batch sizes and label noise supports the characterizations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Step-size schedules could be chosen by tracking the evolving Hessian during training rather than by cross-validation alone.
  • The same Hessian-moment link might be used to compare the trajectories of other first-order methods such as momentum variants.
  • If the bound holds, it supplies a practical diagnostic for when overparameterized models are likely to generalize without explicit regularization terms.

Load-bearing premise

Quantities derived from the Hessian of the training loss alone are sufficient to characterize both the SGD dynamics and a scale-invariant generalization bound without further unstated assumptions on the loss landscape or data distribution.

What would settle it

An experiment on MNIST or CIFAR-10 in which the observed SGD trajectories with fixed or adaptive steps deviate measurably from the paths predicted by the first and second moments of the gradients relative to the computed Hessian, or in which test error violates the proposed Hessian-based bound.

Figures

Figures reproduced from arXiv: 1907.10732 by Arindam Banerjee, Qilong Gu, Tiancong Chen, Xinyan Li, Yingxue Zhou.

Figure 1
Figure 1. Figure 1: Eigen-spectrum dynamics of Hf (θt) (left), Mt (middle), and Hp(θt) (right). The network is trained on Gauss-10 dataset with small batches containing one twentieth of the training samples (5/100). Hp remains significant even after SGD converges, and is close to −Hf (θt) [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dynamics of top 15 principal angles between [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dynamics of top 15 principal angles between [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dynamics of top 15 principal angles between [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Notations for layer-wise Hessian analysis. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Eigen-spectrum dynamics of Hh and Gh, h = 0, 1, 2 for networks trained on Gauss-10 dataset. (a) and (b): small batches containing one twentieth of the training samples (5/100); (c) and (d): large batches containing half of the training samples (50/100). All Ghs are positive semi-definite matrices whose top eigenvalues have the same order of magnitude, indicating that the top few large eigenvalues of Hf (θt… view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise eigenvector loadings for networks trained on Gauss-10 dataset. (a) and (b): small [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Gauss-10: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) at different quantiles of f(θt). From (21), every matrix Gh is positive semi-definite (PSD) since ∇2 φ (θ), the Hessian of the logistic loss, is PSD. The definitions of Gh and Hh has been depicted in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Dynamics of the distribution ∆t(f) = f(θt+1) − f(θt) conditioned at θt for the Gauss-10 dataset trained with small batches containing one twentieth of training samples (5/100). Red horizontal line highlights the value of ∆t(f) = 0. The loss-difference dynamics mainly consist of two phases: (1) the mean of ∆t(f) decreases with an increase of variance (see (a): iteration 1 to 15, and (b): iteration 1 to 100)… view at source ↗
Figure 10
Figure 10. Figure 10: The dynamics of the variance of ∆t(f) = f(θt+1) − f(θt) conditioned at θt during training. The variance sharply increases with a short period of time at the beginning, then continues to decrease until convergence. For both easy and hard problem with various batch sizes, the variance exhibits a similarly behavior. SGD dynamics [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Gauss-10, batch size 5: Distributions from 10,000 runs. Note that (b), (c) and (d) are scale-invariant [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Eigen-spectrum dynamics of Hf (θt) (left), Mt (middle), and Hp(θt) (right) for Gauss-10 dataset trained with large batches containing half of training samples (50/100). Hp(θt) remains significant even after SGD converges, and is close to −Hf (θt). 35 [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Eigen-spectrum dynamics of Hf (θt) (left), Mt (middle), and Hp(θt) (right) for Gauss-2 dataset. (a) and (b): small batches containing one twentieth of training samples (5/100); (c) and (d): large batches containing half of training samples (50/100). Hp(θt) remains significant even after SGD converges, and is close to −Hf (θt). 36 [PITH_FULL_IMAGE:figures/full_fig_p036_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Dynamics of principal angles of top 15 eigenvector space between [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Dynamics of principal angles of top 5 eigenvector space between [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Davis-Kahan sin θ(u, v) = p 1 − (u T v) 2 for Gauss-10 dataset. (a) and (b): small batches containing one twentieth of the training samples (5/100); (c) and (d): large batches containing half of the training samples (50/100). sin θ(u, v) > 0.5 serves as an extra evidence to suggest the primary subspaces spanned by the top eigen-vectors of Mt and Hf (θt) significantly overlaps as training proceeds. However… view at source ↗
Figure 17
Figure 17. Figure 17: Davis-Kahan sin θ(u, v) = p 1 − (u T v) 2 for Gauss-2 dataset. (a) and (b): small batches containing one twentieth of the training samples (5/100); (c) and (d): large batches containing half of the training samples (50/100). sin θ(u, v) > 0.5 serves as an extra evidence to suggest the primary subspaces spanned by the top eigen-vectors of Mt and Hf (θt) significantly overlaps as training proceeds. However,… view at source ↗
Figure 18
Figure 18. Figure 18: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) at different quantiles of f(θt) for Gauss-10 dataset trained with large batches containing half of training samples (50/100). 45 [PITH_FULL_IMAGE:figures/full_fig_p045_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Dynamics of the first 100 iterations for the loss [PITH_FULL_IMAGE:figures/full_fig_p046_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) at different quantiles of f(θt) for Gauss-2 dataset. (a) and (b): small batches containing one twentieth of training samples (5/100); (c) and (d): large batches containing half of training samples (50/100). 47 [PITH_FULL_IMAGE:figures/full_fig_p047_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) for the MNIST trained on networks with 3 to 6 hidden layers and various batch sizes. Networks have more than 100,000 parameters. SGD dynamics behave similarly to our Gauss-k datasets with random labels. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) for the CIFAR-10 dataset with different network architectures and batch sizes. Networks have approximately 1 million parameters. SGD dynamics exhibit similar behavior to Gauss-k datasets. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Dynamics of the distribution ∆t(f) = f(θt+1) − f(θt) conditioned at θt for Gauss-10 datasets trained with large batches containing half of training samples (50/100). Red horizontal line highlights the value of ∆t(f) = 0. The loss-difference dynamics mainly consist of two phases: 1) the mean of ∆t(f) decreases with an increase of variance; 2) the mean of ∆t(f) increases and reaches 0 while the variance shr… view at source ↗
Figure 24
Figure 24. Figure 24: Dynamics of the distribution ∆t(f) = f(θt+1) − f(θt) conditioned at θt for Gauss-2 datasets trained with large batches containing half of training samples (50/100). Red horizontal line highlights the value of ∆t(f) = 0. The loss-difference dynamics changes from two phases to three phases as we increase the difficulty level of the problem. D SGD Dynamics: Proof of Theorem 1 Recall Theorem 1 in Section 5: T… view at source ↗
Figure 25
Figure 25. Figure 25: Distributions from 10,000 runs on Gauss-10 datasets trained with large batches containing half [PITH_FULL_IMAGE:figures/full_fig_p067_25.png] view at source ↗
read the original abstract

While stochastic gradient descent (SGD) and variants have been surprisingly successful for training deep nets, several aspects of the optimization dynamics and generalization are still not well understood. In this paper, we present new empirical observations and theoretical results on both the optimization dynamics and generalization behavior of SGD for deep nets based on the Hessian of the training loss and associated quantities. We consider three specific research questions: (1) what is the relationship between the Hessian of the loss and the second moment of stochastic gradients (SGs)? (2) how can we characterize the stochastic optimization dynamics of SGD with fixed and adaptive step sizes and diagonal pre-conditioning based on the first and second moments of SGs? and (3) how can we characterize a scale-invariant generalization bound of deep nets based on the Hessian of the loss, which by itself is not scale invariant? We shed light on these three questions using theoretical results supported by extensive empirical observations, with experiments on synthetic data, MNIST, and CIFAR-10, with different batch sizes, and with different difficulty levels by synthetically adding random labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to characterize (1) the relationship between the Hessian of the training loss and the second moment of stochastic gradients, (2) the stochastic dynamics of SGD (fixed and adaptive step sizes, diagonal preconditioning) in terms of first and second moments of stochastic gradients, and (3) a scale-invariant generalization bound for deep nets derived from the Hessian (which itself is not scale-invariant), supported by theory and experiments on synthetic data, MNIST, and CIFAR-10 with varying batch sizes and label noise.

Significance. If the derivations hold without hidden assumptions on loss homogeneity or gradient covariance, the work would supply concrete links between Hessian quantities and both optimization trajectories and generalization, with the scale-invariant bound being a potentially useful contribution; the experiments across difficulty levels provide some empirical grounding.

major comments (2)
  1. [Generalization bound section (near Eq. for the bound)] The scale-invariant generalization bound (research question 3) is presented as based on the Hessian, yet the construction appears to require the loss to be approximately homogeneous of degree 2 or explicit per-layer normalization; neither holds in general for ReLU networks with batch-norm and cross-entropy. This assumption is load-bearing for the central claim and must be stated explicitly with a concrete test (e.g., homogeneity check on the trained models).
  2. [Dynamics characterization (Eqs. relating moments to Hessian)] The dynamics characterization (research question 2) replaces the stochastic gradient covariance with the Hessian; the paper must quantify the remainder term and show it does not grow with depth or batch size, as this replacement is central to both the fixed-step and adaptive-step analyses.
minor comments (2)
  1. Clarify the precise definition of 'diagonal pre-conditioning' and how the first and second moments are estimated in the adaptive case.
  2. [Experimental section] The experiments on synthetic data should include a direct comparison of the proposed bound against existing scale-invariant bounds (e.g., those based on weight norms) to demonstrate improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. Below we respond point-by-point to the two major comments, indicating the revisions we will make.

read point-by-point responses
  1. Referee: [Generalization bound section (near Eq. for the bound)] The scale-invariant generalization bound (research question 3) is presented as based on the Hessian, yet the construction appears to require the loss to be approximately homogeneous of degree 2 or explicit per-layer normalization; neither holds in general for ReLU networks with batch-norm and cross-entropy. This assumption is load-bearing for the central claim and must be stated explicitly with a concrete test (e.g., homogeneity check on the trained models).

    Authors: We agree that the scale-invariant bound derivation relies on the training loss behaving approximately homogeneously of degree 2, which arises from the combination of batch normalization (which normalizes scale per layer) and the positive homogeneity of ReLU activations. This property does not hold for arbitrary networks but is a reasonable approximation for the standard architectures and training regimes examined in the paper. In the revision we will (i) explicitly state the homogeneity assumption in the generalization section and (ii) add a concrete numerical verification: for each trained model we will report the empirical degree by evaluating L(c·θ) / L(θ) for several scalars c near 1 and confirm that the ratio is close to c². These checks will be performed on the MNIST and CIFAR-10 models already present in the experiments. revision: yes

  2. Referee: [Dynamics characterization (Eqs. relating moments to Hessian)] The dynamics characterization (research question 2) replaces the stochastic gradient covariance with the Hessian; the paper must quantify the remainder term and show it does not grow with depth or batch size, as this replacement is central to both the fixed-step and adaptive-step analyses.

    Authors: The substitution of the stochastic-gradient second-moment matrix by the Hessian follows from the standard Gauss-Newton / Fisher approximation that becomes exact when the model predictions match the labels (or in the large-batch limit). We acknowledge that a rigorous bound on the remainder is desirable. In the revised manuscript we will add (i) an explicit expression for the remainder term involving third- and fourth-order derivatives and (ii) both theoretical scaling arguments and empirical measurements (on the same synthetic, MNIST, and CIFAR-10 networks) demonstrating that the relative size of the remainder stays bounded and does not increase materially with depth or batch size within the regimes studied. These additions will be placed immediately after the dynamics equations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain.

full rationale

The paper's abstract and described claims present three research questions on relationships between Hessian and stochastic gradient moments, SGD dynamics characterization, and a scale-invariant generalization bound. These are addressed via theoretical results supported by empirical observations on synthetic data, MNIST, and CIFAR-10. No quoted equations or steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the characterizations are framed as independent derivations from first/second moments and Hessian quantities, with no load-bearing self-citations or ansatzes smuggled in. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the central claims rest on unstated background assumptions about the loss and data that cannot be enumerated.

pith-pipeline@v0.9.0 · 5730 in / 1072 out tokens · 20777 ms · 2026-05-24T16:40:23.814108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 20 internal anchors

  1. [1]

    On the Convergence Rate of Training Recurrent Neural Networks

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. arXiv:1810.12065, 2018

  2. [2]

    Methods of Information Geometry, volume 191

    Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry, volume 191. 01 2000

  3. [3]

    A convergence analysis of gradient descent for deep linear neural networks

    Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. In ICLR, 2019

  4. [4]

    On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

    Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv:1802.06509, 2018

  5. [5]

    Spectrally-normalized margin bounds for neural networks

    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In NIPS, 2017

  6. [6]

    Bertsekas

    D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999

  7. [7]

    Concentration inequalities: A nonasymptotic theory of independence

    St´ephane Boucheron, G´abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013

  8. [8]

    SGD learns over- parameterized networks that provably generalize on linearly separable data

    Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. SGD learns over- parameterized networks that provably generalize on linearly separable data. In ICLR, 2018

  9. [9]

    Casella and R.L

    G. Casella and R.L. Berger. Statistical Inference. Duxbury advanced series. Brooks/Cole Publishing Company, 1990

  10. [10]

    Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

    Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv:1611.01838, 2016

  11. [11]

    Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

    Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In ICLR, 2018

  12. [12]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommuni- cations and Signal Processing). Wiley-Interscience, New York, NY , USA, 2006

  13. [13]

    Davis and W

    C. Davis and W. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

  14. [14]

    Sharp Minima Can Generalize For Deep Nets

    Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv:1703.04933, 2017

  15. [15]

    Gradient descent finds global minima of deep neural networks

    Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In ICML, 2019. 27

  16. [16]

    Du, Jason D

    Simon S. Du, Jason D. Lee, Yuandong Tian, Barnab´as P´oczos, and Aarti Singh. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. In ICML, pages 1339–1348. PMLR, 10–15 Jul 2018

  17. [17]

    Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh

    Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR, 2019

  18. [18]

    An overview on the evolution and adoption of deep learning applications used in the industry

    Sourav Dutta. An overview on the evolution and adoption of deep learning applications used in the industry. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1257, 2018

  19. [19]

    Ghadimi and G

    S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic program- ming. SIAM Journal on Optimization, 23(4):2341–2368, 2013

  20. [20]

    An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

    Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. arXiv preprint arXiv:1901.10159, 2019

  21. [21]

    Golub and Charles F

    Gene H. Golub and Charles F. van Loan. Matrix Computations. JHU Press, fourth edition, 2013

  22. [22]

    Deep Learning

    Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org

  23. [23]

    Gradient Descent Happens in a Tiny Subspace

    Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018

  24. [24]

    Naive Set Theory

    Paul Halmos. Naive Set Theory. Van Nostrand, 1960

  25. [25]

    Flat minima

    Sepp Hochreiter and J ¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997

  26. [26]

    A tail inequality for quadratic forms of subgaussian random vectors

    Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17:6 pp., 2012

  27. [27]

    Probability essentials

    Jean Jacod and Philip Protter. Probability essentials. Springer Science & Business Media, 2012

  28. [28]

    Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos J. Storkey. Three factors influencing minima in SGD. In ICANN, 2018

  29. [29]

    Stanislaw Jastrzebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos J. Storkey. Dnn’s sharpest directions along the sgd trajectory. arXiv:1807.05031, 2018

  30. [30]

    On large-batch training for deep learning: Generalization gap and sharp minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017

  31. [31]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015

  32. [32]

    Information theory and dynamical system predictability

    Richard Kleeman. Information theory and dynamical system predictability. Entropy, 13(3):612–649, 2011

  33. [33]

    Learning Multiple Layers of Features from Tiny Images

    Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical Report V ol. 1. No. 4., University of Toronto, 2009

  34. [34]

    An iteration method for the solution of the eigenvalue problem of linear differential and integral operators

    Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Natl. Bur. Stand. B, 45:255–282, 1950. 28

  35. [35]

    Pac-bayes & margins

    John Langford and John Shawe-Taylor. Pac-bayes & margins. In NIPS, 2003

  36. [36]

    Brownian Motion, Martingales, and Stochastic Calculus, volume 274

    Jean-Francois Le Gall. Brownian Motion, Martingales, and Stochastic Calculus, volume 274. Springer, 01 2016

  37. [37]

    Deep learning

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436 EP –, 05 2015

  38. [38]

    Gradient-based learning applied to document recognition

    Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

  39. [39]

    Probability in Banach Spaces: isoperimetry and processes

    Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes . Springer, Berlin, May 1991

  40. [40]

    Lehmann and G

    E.L. Lehmann and G. Casella. Theory of Point Estimation. Springer Verlag, 1998

  41. [41]

    Learning overparameterized neural networks via stochastic gradient descent on structured data

    Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NIPS, 2018

  42. [42]

    The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

    Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effective- ness of sgd in modern over-parametrized learning. arXiv:1712.06559, 2017

  43. [43]

    Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Autograd: Effortless gradients in numpy. In ICML AutoML Workshop, 2015

  44. [44]

    Hoffman, and David M

    Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic gradient descent as approximate bayesian inference. JMLR, 18(1):4873–4907, January 2017

  45. [45]

    New insights and perspectives on the natural gradient method

    James Martens. New insights and perspectives on the natural gradient method. arXiv:1412.1193, Dec 2014

  46. [46]

    Pac-bayesian model averaging

    David A McAllester. Pac-bayesian model averaging. In COLT. ACM, 1999

  47. [47]

    Estimating structured vector autoregressive models

    Igor Melnyk and Arindam Banerjee. Estimating structured vector autoregressive models. In Interna- tional Conference on Machine Learning, pages 830–839, 2016

  48. [48]

    Recent Advances in Deep Learning: An Overview

    Matiur Rahman Minar and Jibon Naher. Recent advances in deep learning: An overview. arXiv:1807.08169, 2018

  49. [49]

    Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience

    Vaishnavh Nagarajan and Zico Kolter. Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. In ICLR, 2019

  50. [50]

    Nemirovski, A

    A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009

  51. [51]

    A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

    Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv:1707.09564, 2017

  52. [52]

    Exploring generalization in deep learning

    Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nati Srebro. Exploring generalization in deep learning. In NIPS, 2017

  53. [53]

    The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

    Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. arXiv preprint arXiv:1811.07062, 2018. 29

  54. [54]

    Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians

    Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. arXiv preprint arXiv:1901.08244, 2019

  55. [55]

    Pearlmutter

    Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Comput., 6(1):147–160, January 1994

  56. [56]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- nay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

  57. [57]

    A Scale Invariant Flatness Measure for Deep Network Minima

    Akshay Rangamani, Nam H Nguyen, Abhishek Kumar, Dzung Phan, Sang H Chin, and Trac D Tran. A scale invariant flatness measure for deep network minima. arXiv preprint arXiv:1902.02434, 2019

  58. [58]

    Radhakrishna Rao

    C. Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. . Bulletin of the Calcutta Mathematical Society, pages 81–89, 1945

  59. [59]

    Reddi, Satyen Kale, and Sanjiv Kumar

    Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In ICLR, 2018

  60. [60]

    A stochastic approximation method

    Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400– 407, 09 1951

  61. [61]

    Principles of mathematical analysis

    Walter Rudin. Principles of mathematical analysis. McGraw-Hill Book Co., New York, third edition,

  62. [62]

    International Series in Pure and Applied Mathematics

  63. [63]

    Spurious local minima are common in two-layer relu neural networks

    Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In ICML, 2018

  64. [64]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv:1611.07476, 2016

  65. [65]

    Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

    Levent Sagun, Utku Evci, V . Ugur G¨uney, Yann Dauphin, and L´eon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv:1706.04454, 2017

  66. [66]

    Introduction to the gamma function

    Pascal Sebah and Xavier Gourdon. Introduction to the gamma function. 2002

  67. [67]

    Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

    Ohad Shamir. Exponential convergence time of gradient descent for one-dimensional deep linear neural networks. arXiv:1809.08587, 2018

  68. [68]

    Smith and Quoc V

    Samuel L. Smith and Quoc V . Le. A bayesian perspective on generalization and stochastic gradient descent. In ICLR, 2018

  69. [69]

    Escaping saddle points with adaptive gradient methods

    Matthew Staib, Sashank Reddi, Satyen Kale, Sanjiv Kumar, and Suvrit Sra. Escaping saddle points with adaptive gradient methods. In ICML, pages 5956–5965, 2019

  70. [70]

    Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis

    Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Normalized flat minima: Exploring scale in- variant definition of flat minima for neural networks using pac-bayesian analysis. arXiv preprint arXiv:1901.04653, 2019

  71. [71]

    Local smoothness in variance reduced optimization

    Daniel Vainsencher, Han Liu, and Tong Zhang. Local smoothness in variance reduced optimization. In Advances in Neural Information Processing Systems 28, pages 2179–2187, 2015. 30

  72. [72]

    Deep learning: A review

    Rocio Vargas, Amir Mosavi, and Ramon Ruiz. Deep learning: A review. Advances in Intelligent Systems and Computing, 5, 08 2017

  73. [73]

    Introduction to the non-asymptotic analysis of random matrices

    Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices . Cambridge University Press, 2012

  74. [74]

    High-dimensional probability: An introduction with applications in data science, volume 47

    Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018

  75. [75]

    All of Statistics: A Concise Course in Statistical Inference

    Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference . Springer Publishing Company, Incorporated, 2010

  76. [76]

    Probability with martingales

    David Williams. Probability with martingales. Cambridge university press, 1991

  77. [77]

    A Walk with SGD

    Chen Xing, Devansh Arpit, Christos Tsirigotis, and Y Bengio. A walk with sgd. arXiv:1802.08770, 02 2018

  78. [78]

    Positively Scale-Invariant Flatness of ReLU Neural Networks

    Mingyang Yi, Qi Meng, Wei Chen, Zhi-ming Ma, and Tie-Yan Liu. Positively scale-invariant flatness of relu neural networks. arXiv preprint arXiv:1903.02237, 2019

  79. [79]

    Y . Yu, T. Wang, and R. J. Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, apr 2015

  80. [80]

    Small nonlinearities in activation functions create bad local minima in neural networks

    Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small nonlinearities in activation functions create bad local minima in neural networks. arXiv:1802.03487, February 2018

Showing first 80 references.