Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

Arindam Banerjee; Qilong Gu; Tiancong Chen; Xinyan Li; Yingxue Zhou

arxiv: 1907.10732 · v1 · pith:4KB2W4QNnew · submitted 2019-07-24 · 💻 cs.LG · stat.ML

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

Xinyan Li , Qilong Gu , Yingxue Zhou , Tiancong Chen , Arindam Banerjee This is my paper

Pith reviewed 2026-05-24 16:40 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords SGDHessiandeep neural networksoptimization dynamicsgeneralization boundsstochastic gradientsscale invarianceadaptive step sizes

0 comments

The pith

The Hessian of the training loss characterizes SGD dynamics through gradient moments and yields a scale-invariant generalization bound for deep nets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates three questions about SGD in deep networks by relating quantities derived from the Hessian of the training loss to the first and second moments of stochastic gradients. It characterizes the trajectories of SGD under fixed step sizes, adaptive step sizes, and diagonal preconditioning. It also constructs a generalization bound that stays invariant under parameter rescaling even though the Hessian itself is not. A reader would care because these relations tie the curvature of the loss directly to both the path taken during training and the final performance on unseen data.

Core claim

The authors show that the Hessian of the training loss is linked to the second moment of stochastic gradients, which in turn governs the stochastic dynamics of SGD for fixed and adaptive step sizes with diagonal preconditioning. They further derive a generalization bound expressed in terms of the Hessian that is invariant to scaling of the network parameters, supported by experiments on synthetic data, MNIST, and CIFAR-10 across varying batch sizes and label noise levels.

What carries the argument

The Hessian matrix of the training loss, which connects loss curvature to the second moment of stochastic gradients and supplies the basis for a scale-invariant bound.

If this is right

SGD with fixed step sizes follows dynamics determined by the first and second moments of stochastic gradients.
Adaptive step sizes and diagonal preconditioning admit analogous characterizations using the same moments.
A generalization bound for deep nets can be stated directly from the Hessian in a form that remains unchanged under parameter scaling.
Empirical verification on MNIST and CIFAR-10 across batch sizes and label noise supports the characterizations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Step-size schedules could be chosen by tracking the evolving Hessian during training rather than by cross-validation alone.
The same Hessian-moment link might be used to compare the trajectories of other first-order methods such as momentum variants.
If the bound holds, it supplies a practical diagnostic for when overparameterized models are likely to generalize without explicit regularization terms.

Load-bearing premise

Quantities derived from the Hessian of the training loss alone are sufficient to characterize both the SGD dynamics and a scale-invariant generalization bound without further unstated assumptions on the loss landscape or data distribution.

What would settle it

An experiment on MNIST or CIFAR-10 in which the observed SGD trajectories with fixed or adaptive steps deviate measurably from the paths predicted by the first and second moments of the gradients relative to the computed Hessian, or in which test error violates the proposed Hessian-based bound.

Figures

Figures reproduced from arXiv: 1907.10732 by Arindam Banerjee, Qilong Gu, Tiancong Chen, Xinyan Li, Yingxue Zhou.

**Figure 1.** Figure 1: Eigen-spectrum dynamics of Hf (θt) (left), Mt (middle), and Hp(θt) (right). The network is trained on Gauss-10 dataset with small batches containing one twentieth of the training samples (5/100). Hp remains significant even after SGD converges, and is close to −Hf (θt) [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Dynamics of top 15 principal angles between [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Dynamics of top 15 principal angles between [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Dynamics of top 15 principal angles between [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Notations for layer-wise Hessian analysis. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Eigen-spectrum dynamics of Hh and Gh, h = 0, 1, 2 for networks trained on Gauss-10 dataset. (a) and (b): small batches containing one twentieth of the training samples (5/100); (c) and (d): large batches containing half of the training samples (50/100). All Ghs are positive semi-definite matrices whose top eigenvalues have the same order of magnitude, indicating that the top few large eigenvalues of Hf (θt… view at source ↗

**Figure 7.** Figure 7: Layer-wise eigenvector loadings for networks trained on Gauss-10 dataset. (a) and (b): small [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Gauss-10: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) at different quantiles of f(θt). From (21), every matrix Gh is positive semi-definite (PSD) since ∇2 φ (θ), the Hessian of the logistic loss, is PSD. The definitions of Gh and Hh has been depicted in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Dynamics of the distribution ∆t(f) = f(θt+1) − f(θt) conditioned at θt for the Gauss-10 dataset trained with small batches containing one twentieth of training samples (5/100). Red horizontal line highlights the value of ∆t(f) = 0. The loss-difference dynamics mainly consist of two phases: (1) the mean of ∆t(f) decreases with an increase of variance (see (a): iteration 1 to 15, and (b): iteration 1 to 100)… view at source ↗

**Figure 10.** Figure 10: The dynamics of the variance of ∆t(f) = f(θt+1) − f(θt) conditioned at θt during training. The variance sharply increases with a short period of time at the beginning, then continues to decrease until convergence. For both easy and hard problem with various batch sizes, the variance exhibits a similarly behavior. SGD dynamics [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Gauss-10, batch size 5: Distributions from 10,000 runs. Note that (b), (c) and (d) are scale-invariant [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Eigen-spectrum dynamics of Hf (θt) (left), Mt (middle), and Hp(θt) (right) for Gauss-10 dataset trained with large batches containing half of training samples (50/100). Hp(θt) remains significant even after SGD converges, and is close to −Hf (θt). 35 [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗

**Figure 13.** Figure 13: Eigen-spectrum dynamics of Hf (θt) (left), Mt (middle), and Hp(θt) (right) for Gauss-2 dataset. (a) and (b): small batches containing one twentieth of training samples (5/100); (c) and (d): large batches containing half of training samples (50/100). Hp(θt) remains significant even after SGD converges, and is close to −Hf (θt). 36 [PITH_FULL_IMAGE:figures/full_fig_p036_13.png] view at source ↗

**Figure 14.** Figure 14: Dynamics of principal angles of top 15 eigenvector space between [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗

**Figure 15.** Figure 15: Dynamics of principal angles of top 5 eigenvector space between [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗

**Figure 16.** Figure 16: Davis-Kahan sin θ(u, v) = p 1 − (u T v) 2 for Gauss-10 dataset. (a) and (b): small batches containing one twentieth of the training samples (5/100); (c) and (d): large batches containing half of the training samples (50/100). sin θ(u, v) > 0.5 serves as an extra evidence to suggest the primary subspaces spanned by the top eigen-vectors of Mt and Hf (θt) significantly overlaps as training proceeds. However… view at source ↗

**Figure 17.** Figure 17: Davis-Kahan sin θ(u, v) = p 1 − (u T v) 2 for Gauss-2 dataset. (a) and (b): small batches containing one twentieth of the training samples (5/100); (c) and (d): large batches containing half of the training samples (50/100). sin θ(u, v) > 0.5 serves as an extra evidence to suggest the primary subspaces spanned by the top eigen-vectors of Mt and Hf (θt) significantly overlaps as training proceeds. However,… view at source ↗

**Figure 18.** Figure 18: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) at different quantiles of f(θt) for Gauss-10 dataset trained with large batches containing half of training samples (50/100). 45 [PITH_FULL_IMAGE:figures/full_fig_p045_18.png] view at source ↗

**Figure 19.** Figure 19: Dynamics of the first 100 iterations for the loss [PITH_FULL_IMAGE:figures/full_fig_p046_19.png] view at source ↗

**Figure 20.** Figure 20: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) at different quantiles of f(θt) for Gauss-2 dataset. (a) and (b): small batches containing one twentieth of training samples (5/100); (c) and (d): large batches containing half of training samples (50/100). 47 [PITH_FULL_IMAGE:figures/full_fig_p047_20.png] view at source ↗

**Figure 21.** Figure 21: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) for the MNIST trained on networks with 3 to 6 hidden layers and various batch sizes. Networks have more than 100,000 parameters. SGD dynamics behave similarly to our Gauss-k datasets with random labels. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_21.png] view at source ↗

**Figure 22.** Figure 22: Dynamics of the loss f(θt) (left), the angle of two successive SGs cos(gt , gt−1) (middle), and the norm of the SGs kgtk2 (right) for the CIFAR-10 dataset with different network architectures and batch sizes. Networks have approximately 1 million parameters. SGD dynamics exhibit similar behavior to Gauss-k datasets. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_22.png] view at source ↗

**Figure 23.** Figure 23: Dynamics of the distribution ∆t(f) = f(θt+1) − f(θt) conditioned at θt for Gauss-10 datasets trained with large batches containing half of training samples (50/100). Red horizontal line highlights the value of ∆t(f) = 0. The loss-difference dynamics mainly consist of two phases: 1) the mean of ∆t(f) decreases with an increase of variance; 2) the mean of ∆t(f) increases and reaches 0 while the variance shr… view at source ↗

**Figure 24.** Figure 24: Dynamics of the distribution ∆t(f) = f(θt+1) − f(θt) conditioned at θt for Gauss-2 datasets trained with large batches containing half of training samples (50/100). Red horizontal line highlights the value of ∆t(f) = 0. The loss-difference dynamics changes from two phases to three phases as we increase the difficulty level of the problem. D SGD Dynamics: Proof of Theorem 1 Recall Theorem 1 in Section 5: T… view at source ↗

**Figure 25.** Figure 25: Distributions from 10,000 runs on Gauss-10 datasets trained with large batches containing half [PITH_FULL_IMAGE:figures/full_fig_p067_25.png] view at source ↗

read the original abstract

While stochastic gradient descent (SGD) and variants have been surprisingly successful for training deep nets, several aspects of the optimization dynamics and generalization are still not well understood. In this paper, we present new empirical observations and theoretical results on both the optimization dynamics and generalization behavior of SGD for deep nets based on the Hessian of the training loss and associated quantities. We consider three specific research questions: (1) what is the relationship between the Hessian of the loss and the second moment of stochastic gradients (SGs)? (2) how can we characterize the stochastic optimization dynamics of SGD with fixed and adaptive step sizes and diagonal pre-conditioning based on the first and second moments of SGs? and (3) how can we characterize a scale-invariant generalization bound of deep nets based on the Hessian of the loss, which by itself is not scale invariant? We shed light on these three questions using theoretical results supported by extensive empirical observations, with experiments on synthetic data, MNIST, and CIFAR-10, with different batch sizes, and with different difficulty levels by synthetically adding random labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives some empirical links between Hessian and gradient second moments plus SGD dynamics sketches, but the scale-invariant generalization bound rests on unstated assumptions about loss homogeneity that standard deep nets do not satisfy.

read the letter

This paper's main contribution is some empirical connections between the Hessian and the second moments of stochastic gradients, plus characterizations of SGD dynamics with fixed and adaptive steps. The scale-invariant generalization bound is the weaker part. It does well in setting up three specific questions and running experiments on synthetic data, MNIST, and CIFAR-10 across batch sizes and with random labels to test difficulty. That gives a range of observations on how these quantities behave. The soft spot is in the generalization claim. The abstract notes the Hessian isn't scale-invariant and offers a bound based on it, but this likely requires the loss to be roughly homogeneous of degree 2 or layer-wise normalization, which doesn't hold for standard ReLU networks with batch-norm or cross-entropy. The experiments don't seem to address whether those conditions are met, so the bound may only apply in limited regimes not representative of typical training. The dynamics characterization based on first and second moments also assumes the stochastic gradient covariance can be tied to the Hessian without large remainder terms, which might not scale well with depth. This work is aimed at people studying curvature-based views of deep net optimization. A reader interested in that subfield would get some new observations to think about, even if the theory needs tightening. It deserves serious referee time because the questions are well-posed and the experiments are reasonably broad, though the generalization section will need scrutiny on the assumptions.

Referee Report

2 major / 2 minor

Summary. The paper claims to characterize (1) the relationship between the Hessian of the training loss and the second moment of stochastic gradients, (2) the stochastic dynamics of SGD (fixed and adaptive step sizes, diagonal preconditioning) in terms of first and second moments of stochastic gradients, and (3) a scale-invariant generalization bound for deep nets derived from the Hessian (which itself is not scale-invariant), supported by theory and experiments on synthetic data, MNIST, and CIFAR-10 with varying batch sizes and label noise.

Significance. If the derivations hold without hidden assumptions on loss homogeneity or gradient covariance, the work would supply concrete links between Hessian quantities and both optimization trajectories and generalization, with the scale-invariant bound being a potentially useful contribution; the experiments across difficulty levels provide some empirical grounding.

major comments (2)

[Generalization bound section (near Eq. for the bound)] The scale-invariant generalization bound (research question 3) is presented as based on the Hessian, yet the construction appears to require the loss to be approximately homogeneous of degree 2 or explicit per-layer normalization; neither holds in general for ReLU networks with batch-norm and cross-entropy. This assumption is load-bearing for the central claim and must be stated explicitly with a concrete test (e.g., homogeneity check on the trained models).
[Dynamics characterization (Eqs. relating moments to Hessian)] The dynamics characterization (research question 2) replaces the stochastic gradient covariance with the Hessian; the paper must quantify the remainder term and show it does not grow with depth or batch size, as this replacement is central to both the fixed-step and adaptive-step analyses.

minor comments (2)

Clarify the precise definition of 'diagonal pre-conditioning' and how the first and second moments are estimated in the adaptive case.
[Experimental section] The experiments on synthetic data should include a direct comparison of the proposed bound against existing scale-invariant bounds (e.g., those based on weight norms) to demonstrate improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. Below we respond point-by-point to the two major comments, indicating the revisions we will make.

read point-by-point responses

Referee: [Generalization bound section (near Eq. for the bound)] The scale-invariant generalization bound (research question 3) is presented as based on the Hessian, yet the construction appears to require the loss to be approximately homogeneous of degree 2 or explicit per-layer normalization; neither holds in general for ReLU networks with batch-norm and cross-entropy. This assumption is load-bearing for the central claim and must be stated explicitly with a concrete test (e.g., homogeneity check on the trained models).

Authors: We agree that the scale-invariant bound derivation relies on the training loss behaving approximately homogeneously of degree 2, which arises from the combination of batch normalization (which normalizes scale per layer) and the positive homogeneity of ReLU activations. This property does not hold for arbitrary networks but is a reasonable approximation for the standard architectures and training regimes examined in the paper. In the revision we will (i) explicitly state the homogeneity assumption in the generalization section and (ii) add a concrete numerical verification: for each trained model we will report the empirical degree by evaluating L(c·θ) / L(θ) for several scalars c near 1 and confirm that the ratio is close to c². These checks will be performed on the MNIST and CIFAR-10 models already present in the experiments. revision: yes
Referee: [Dynamics characterization (Eqs. relating moments to Hessian)] The dynamics characterization (research question 2) replaces the stochastic gradient covariance with the Hessian; the paper must quantify the remainder term and show it does not grow with depth or batch size, as this replacement is central to both the fixed-step and adaptive-step analyses.

Authors: The substitution of the stochastic-gradient second-moment matrix by the Hessian follows from the standard Gauss-Newton / Fisher approximation that becomes exact when the model predictions match the labels (or in the large-batch limit). We acknowledge that a rigorous bound on the remainder is desirable. In the revised manuscript we will add (i) an explicit expression for the remainder term involving third- and fourth-order derivatives and (ii) both theoretical scaling arguments and empirical measurements (on the same synthetic, MNIST, and CIFAR-10 networks) demonstrating that the relative size of the remainder stays bounded and does not increase materially with depth or batch size within the regimes studied. These additions will be placed immediately after the dynamics equations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain.

full rationale

The paper's abstract and described claims present three research questions on relationships between Hessian and stochastic gradient moments, SGD dynamics characterization, and a scale-invariant generalization bound. These are addressed via theoretical results supported by empirical observations on synthetic data, MNIST, and CIFAR-10. No quoted equations or steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the characterizations are framed as independent derivations from first/second moments and Hessian quantities, with no load-bearing self-citations or ansatzes smuggled in. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the central claims rest on unstated background assumptions about the loss and data that cannot be enumerated.

pith-pipeline@v0.9.0 · 5730 in / 1072 out tokens · 20777 ms · 2026-05-24T16:40:23.814108+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 20 internal anchors

[1]

On the Convergence Rate of Training Recurrent Neural Networks

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. arXiv:1810.12065, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Methods of Information Geometry, volume 191

Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry, volume 191. 01 2000

work page 2000
[3]

A convergence analysis of gradient descent for deep linear neural networks

Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. In ICLR, 2019

work page 2019
[4]

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv:1802.06509, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In NIPS, 2017

work page 2017
[6]

Bertsekas

D.P. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, 1999

work page 1999
[7]

Concentration inequalities: A nonasymptotic theory of independence

St´ephane Boucheron, G´abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013

work page 2013
[8]

SGD learns over- parameterized networks that provably generalize on linearly separable data

Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. SGD learns over- parameterized networks that provably generalize on linearly separable data. In ICLR, 2018

work page 2018
[9]

Casella and R.L

G. Casella and R.L. Berger. Statistical Inference. Duxbury advanced series. Brooks/Cole Publishing Company, 1990

work page 1990
[10]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv:1611.01838, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In ICLR, 2018

work page 2018
[12]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommuni- cations and Signal Processing). Wiley-Interscience, New York, NY , USA, 2006

work page 2006
[13]

Davis and W

C. Davis and W. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

work page 1970
[14]

Sharp Minima Can Generalize For Deep Nets

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv:1703.04933, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Gradient descent ﬁnds global minima of deep neural networks

Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent ﬁnds global minima of deep neural networks. In ICML, 2019. 27

work page 2019
[16]

Du, Jason D

Simon S. Du, Jason D. Lee, Yuandong Tian, Barnab´as P´oczos, and Aarti Singh. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. In ICML, pages 1339–1348. PMLR, 10–15 Jul 2018

work page 2018
[17]

Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh

Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR, 2019

work page 2019
[18]

An overview on the evolution and adoption of deep learning applications used in the industry

Sourav Dutta. An overview on the evolution and adoption of deep learning applications used in the industry. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1257, 2018

work page 2018
[19]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic ﬁrst- and zeroth-order methods for nonconvex stochastic program- ming. SIAM Journal on Optimization, 23(4):2341–2368, 2013

work page 2013
[20]

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. arXiv preprint arXiv:1901.10159, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[21]

Golub and Charles F

Gene H. Golub and Charles F. van Loan. Matrix Computations. JHU Press, fourth edition, 2013

work page 2013
[22]

Deep Learning

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org

work page 2016
[23]

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Naive Set Theory

Paul Halmos. Naive Set Theory. Van Nostrand, 1960

work page 1960
[25]

Flat minima

Sepp Hochreiter and J ¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997

work page 1997
[26]

A tail inequality for quadratic forms of subgaussian random vectors

Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17:6 pp., 2012

work page 2012
[27]

Probability essentials

Jean Jacod and Philip Protter. Probability essentials. Springer Science & Business Media, 2012

work page 2012
[28]

Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos J. Storkey. Three factors inﬂuencing minima in SGD. In ICANN, 2018

work page 2018
[29]

Stanislaw Jastrzebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos J. Storkey. Dnn’s sharpest directions along the sgd trajectory. arXiv:1807.05031, 2018

work page arXiv 2018
[30]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017

work page 2017
[31]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015

work page 2015
[32]

Information theory and dynamical system predictability

Richard Kleeman. Information theory and dynamical system predictability. Entropy, 13(3):612–649, 2011

work page 2011
[33]

Learning Multiple Layers of Features from Tiny Images

Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical Report V ol. 1. No. 4., University of Toronto, 2009

work page 2009
[34]

An iteration method for the solution of the eigenvalue problem of linear differential and integral operators

Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Natl. Bur. Stand. B, 45:255–282, 1950. 28

work page 1950
[35]

Pac-bayes & margins

John Langford and John Shawe-Taylor. Pac-bayes & margins. In NIPS, 2003

work page 2003
[36]

Brownian Motion, Martingales, and Stochastic Calculus, volume 274

Jean-Francois Le Gall. Brownian Motion, Martingales, and Stochastic Calculus, volume 274. Springer, 01 2016

work page 2016
[37]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436 EP –, 05 2015

work page 2015
[38]

Gradient-based learning applied to document recognition

Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998
[39]

Probability in Banach Spaces: isoperimetry and processes

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes . Springer, Berlin, May 1991

work page 1991
[40]

Lehmann and G

E.L. Lehmann and G. Casella. Theory of Point Estimation. Springer Verlag, 1998

work page 1998
[41]

Learning overparameterized neural networks via stochastic gradient descent on structured data

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NIPS, 2018

work page 2018
[42]

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effective- ness of sgd in modern over-parametrized learning. arXiv:1712.06559, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Autograd: Effortless gradients in numpy. In ICML AutoML Workshop, 2015

work page 2015
[44]

Hoffman, and David M

Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic gradient descent as approximate bayesian inference. JMLR, 18(1):4873–4907, January 2017

work page 2017
[45]

New insights and perspectives on the natural gradient method

James Martens. New insights and perspectives on the natural gradient method. arXiv:1412.1193, Dec 2014

work page arXiv 2014
[46]

Pac-bayesian model averaging

David A McAllester. Pac-bayesian model averaging. In COLT. ACM, 1999

work page 1999
[47]

Estimating structured vector autoregressive models

Igor Melnyk and Arindam Banerjee. Estimating structured vector autoregressive models. In Interna- tional Conference on Machine Learning, pages 830–839, 2016

work page 2016
[48]

Recent Advances in Deep Learning: An Overview

Matiur Rahman Minar and Jibon Naher. Recent advances in deep learning: An overview. arXiv:1807.08169, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience

Vaishnavh Nagarajan and Zico Kolter. Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. In ICLR, 2019

work page 2019
[50]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009

work page 2009
[51]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv:1707.09564, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

Exploring generalization in deep learning

Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nati Srebro. Exploring generalization in deep learning. In NIPS, 2017

work page 2017
[53]

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. arXiv preprint arXiv:1811.07062, 2018. 29

work page internal anchor Pith review Pith/arXiv arXiv 2018
[54]

Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians

Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. arXiv preprint arXiv:1901.08244, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[55]

Pearlmutter

Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Comput., 6(1):147–160, January 1994

work page 1994
[56]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- nay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011
[57]

A Scale Invariant Flatness Measure for Deep Network Minima

Akshay Rangamani, Nam H Nguyen, Abhishek Kumar, Dzung Phan, Sang H Chin, and Trac D Tran. A scale invariant ﬂatness measure for deep network minima. arXiv preprint arXiv:1902.02434, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[58]

Radhakrishna Rao

C. Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. . Bulletin of the Calcutta Mathematical Society, pages 81–89, 1945

work page 1945
[59]

Reddi, Satyen Kale, and Sanjiv Kumar

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In ICLR, 2018

work page 2018
[60]

A stochastic approximation method

Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400– 407, 09 1951

work page 1951
[61]

Principles of mathematical analysis

Walter Rudin. Principles of mathematical analysis. McGraw-Hill Book Co., New York, third edition,

work page
[62]

International Series in Pure and Applied Mathematics

work page
[63]

Spurious local minima are common in two-layer relu neural networks

Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In ICML, 2018

work page 2018
[64]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv:1611.07476, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[65]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci, V . Ugur G¨uney, Yann Dauphin, and L´eon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv:1706.04454, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[66]

Introduction to the gamma function

Pascal Sebah and Xavier Gourdon. Introduction to the gamma function. 2002

work page 2002
[67]

Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

Ohad Shamir. Exponential convergence time of gradient descent for one-dimensional deep linear neural networks. arXiv:1809.08587, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[68]

Smith and Quoc V

Samuel L. Smith and Quoc V . Le. A bayesian perspective on generalization and stochastic gradient descent. In ICLR, 2018

work page 2018
[69]

Escaping saddle points with adaptive gradient methods

Matthew Staib, Sashank Reddi, Satyen Kale, Sanjiv Kumar, and Suvrit Sra. Escaping saddle points with adaptive gradient methods. In ICML, pages 5956–5965, 2019

work page 2019
[70]

Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis

Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Normalized ﬂat minima: Exploring scale in- variant deﬁnition of ﬂat minima for neural networks using pac-bayesian analysis. arXiv preprint arXiv:1901.04653, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[71]

Local smoothness in variance reduced optimization

Daniel Vainsencher, Han Liu, and Tong Zhang. Local smoothness in variance reduced optimization. In Advances in Neural Information Processing Systems 28, pages 2179–2187, 2015. 30

work page 2015
[72]

Deep learning: A review

Rocio Vargas, Amir Mosavi, and Ramon Ruiz. Deep learning: A review. Advances in Intelligent Systems and Computing, 5, 08 2017

work page 2017
[73]

Introduction to the non-asymptotic analysis of random matrices

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices . Cambridge University Press, 2012

work page 2012
[74]

High-dimensional probability: An introduction with applications in data science, volume 47

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018

work page 2018
[75]

All of Statistics: A Concise Course in Statistical Inference

Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference . Springer Publishing Company, Incorporated, 2010

work page 2010
[76]

Probability with martingales

David Williams. Probability with martingales. Cambridge university press, 1991

work page 1991
[77]

A Walk with SGD

Chen Xing, Devansh Arpit, Christos Tsirigotis, and Y Bengio. A walk with sgd. arXiv:1802.08770, 02 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[78]

Positively Scale-Invariant Flatness of ReLU Neural Networks

Mingyang Yi, Qi Meng, Wei Chen, Zhi-ming Ma, and Tie-Yan Liu. Positively scale-invariant ﬂatness of relu neural networks. arXiv preprint arXiv:1903.02237, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[79]

Y . Yu, T. Wang, and R. J. Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, apr 2015

work page 2015
[80]

Small nonlinearities in activation functions create bad local minima in neural networks

Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small nonlinearities in activation functions create bad local minima in neural networks. arXiv:1802.03487, February 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

Showing first 80 references.

[1] [1]

On the Convergence Rate of Training Recurrent Neural Networks

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. arXiv:1810.12065, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Methods of Information Geometry, volume 191

Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry, volume 191. 01 2000

work page 2000

[3] [3]

A convergence analysis of gradient descent for deep linear neural networks

Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. In ICLR, 2019

work page 2019

[4] [4]

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv:1802.06509, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In NIPS, 2017

work page 2017

[6] [6]

Bertsekas

D.P. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, 1999

work page 1999

[7] [7]

Concentration inequalities: A nonasymptotic theory of independence

St´ephane Boucheron, G´abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013

work page 2013

[8] [8]

SGD learns over- parameterized networks that provably generalize on linearly separable data

Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. SGD learns over- parameterized networks that provably generalize on linearly separable data. In ICLR, 2018

work page 2018

[9] [9]

Casella and R.L

G. Casella and R.L. Berger. Statistical Inference. Duxbury advanced series. Brooks/Cole Publishing Company, 1990

work page 1990

[10] [10]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv:1611.01838, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In ICLR, 2018

work page 2018

[12] [12]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommuni- cations and Signal Processing). Wiley-Interscience, New York, NY , USA, 2006

work page 2006

[13] [13]

Davis and W

C. Davis and W. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

work page 1970

[14] [14]

Sharp Minima Can Generalize For Deep Nets

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv:1703.04933, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Gradient descent ﬁnds global minima of deep neural networks

Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent ﬁnds global minima of deep neural networks. In ICML, 2019. 27

work page 2019

[16] [16]

Du, Jason D

Simon S. Du, Jason D. Lee, Yuandong Tian, Barnab´as P´oczos, and Aarti Singh. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. In ICML, pages 1339–1348. PMLR, 10–15 Jul 2018

work page 2018

[17] [17]

Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh

Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR, 2019

work page 2019

[18] [18]

An overview on the evolution and adoption of deep learning applications used in the industry

Sourav Dutta. An overview on the evolution and adoption of deep learning applications used in the industry. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1257, 2018

work page 2018

[19] [19]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic ﬁrst- and zeroth-order methods for nonconvex stochastic program- ming. SIAM Journal on Optimization, 23(4):2341–2368, 2013

work page 2013

[20] [20]

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. arXiv preprint arXiv:1901.10159, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[21] [21]

Golub and Charles F

Gene H. Golub and Charles F. van Loan. Matrix Computations. JHU Press, fourth edition, 2013

work page 2013

[22] [22]

Deep Learning

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org

work page 2016

[23] [23]

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

Naive Set Theory

Paul Halmos. Naive Set Theory. Van Nostrand, 1960

work page 1960

[25] [25]

Flat minima

Sepp Hochreiter and J ¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997

work page 1997

[26] [26]

A tail inequality for quadratic forms of subgaussian random vectors

Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17:6 pp., 2012

work page 2012

[27] [27]

Probability essentials

Jean Jacod and Philip Protter. Probability essentials. Springer Science & Business Media, 2012

work page 2012

[28] [28]

Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos J. Storkey. Three factors inﬂuencing minima in SGD. In ICANN, 2018

work page 2018

[29] [29]

Stanislaw Jastrzebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos J. Storkey. Dnn’s sharpest directions along the sgd trajectory. arXiv:1807.05031, 2018

work page arXiv 2018

[30] [30]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017

work page 2017

[31] [31]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015

work page 2015

[32] [32]

Information theory and dynamical system predictability

Richard Kleeman. Information theory and dynamical system predictability. Entropy, 13(3):612–649, 2011

work page 2011

[33] [33]

Learning Multiple Layers of Features from Tiny Images

Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical Report V ol. 1. No. 4., University of Toronto, 2009

work page 2009

[34] [34]

An iteration method for the solution of the eigenvalue problem of linear differential and integral operators

Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Natl. Bur. Stand. B, 45:255–282, 1950. 28

work page 1950

[35] [35]

Pac-bayes & margins

John Langford and John Shawe-Taylor. Pac-bayes & margins. In NIPS, 2003

work page 2003

[36] [36]

Brownian Motion, Martingales, and Stochastic Calculus, volume 274

Jean-Francois Le Gall. Brownian Motion, Martingales, and Stochastic Calculus, volume 274. Springer, 01 2016

work page 2016

[37] [37]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436 EP –, 05 2015

work page 2015

[38] [38]

Gradient-based learning applied to document recognition

Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998

[39] [39]

Probability in Banach Spaces: isoperimetry and processes

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes . Springer, Berlin, May 1991

work page 1991

[40] [40]

Lehmann and G

E.L. Lehmann and G. Casella. Theory of Point Estimation. Springer Verlag, 1998

work page 1998

[41] [41]

Learning overparameterized neural networks via stochastic gradient descent on structured data

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NIPS, 2018

work page 2018

[42] [42]

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effective- ness of sgd in modern over-parametrized learning. arXiv:1712.06559, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[43] [43]

Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Autograd: Effortless gradients in numpy. In ICML AutoML Workshop, 2015

work page 2015

[44] [44]

Hoffman, and David M

Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic gradient descent as approximate bayesian inference. JMLR, 18(1):4873–4907, January 2017

work page 2017

[45] [45]

New insights and perspectives on the natural gradient method

James Martens. New insights and perspectives on the natural gradient method. arXiv:1412.1193, Dec 2014

work page arXiv 2014

[46] [46]

Pac-bayesian model averaging

David A McAllester. Pac-bayesian model averaging. In COLT. ACM, 1999

work page 1999

[47] [47]

Estimating structured vector autoregressive models

Igor Melnyk and Arindam Banerjee. Estimating structured vector autoregressive models. In Interna- tional Conference on Machine Learning, pages 830–839, 2016

work page 2016

[48] [48]

Recent Advances in Deep Learning: An Overview

Matiur Rahman Minar and Jibon Naher. Recent advances in deep learning: An overview. arXiv:1807.08169, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[49] [49]

Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience

Vaishnavh Nagarajan and Zico Kolter. Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. In ICLR, 2019

work page 2019

[50] [50]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009

work page 2009

[51] [51]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv:1707.09564, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[52] [52]

Exploring generalization in deep learning

Behnam Neyshabur, Srinadh Bhojanapalli, David Mcallester, and Nati Srebro. Exploring generalization in deep learning. In NIPS, 2017

work page 2017

[53] [53]

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. arXiv preprint arXiv:1811.07062, 2018. 29

work page internal anchor Pith review Pith/arXiv arXiv 2018

[54] [54]

Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians

Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. arXiv preprint arXiv:1901.08244, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[55] [55]

Pearlmutter

Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Comput., 6(1):147–160, January 1994

work page 1994

[56] [56]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- nay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011

[57] [57]

A Scale Invariant Flatness Measure for Deep Network Minima

Akshay Rangamani, Nam H Nguyen, Abhishek Kumar, Dzung Phan, Sang H Chin, and Trac D Tran. A scale invariant ﬂatness measure for deep network minima. arXiv preprint arXiv:1902.02434, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[58] [58]

Radhakrishna Rao

C. Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. . Bulletin of the Calcutta Mathematical Society, pages 81–89, 1945

work page 1945

[59] [59]

Reddi, Satyen Kale, and Sanjiv Kumar

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In ICLR, 2018

work page 2018

[60] [60]

A stochastic approximation method

Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400– 407, 09 1951

work page 1951

[61] [61]

Principles of mathematical analysis

Walter Rudin. Principles of mathematical analysis. McGraw-Hill Book Co., New York, third edition,

work page

[62] [62]

International Series in Pure and Applied Mathematics

work page

[63] [63]

Spurious local minima are common in two-layer relu neural networks

Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In ICML, 2018

work page 2018

[64] [64]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv:1611.07476, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[65] [65]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci, V . Ugur G¨uney, Yann Dauphin, and L´eon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv:1706.04454, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[66] [66]

Introduction to the gamma function

Pascal Sebah and Xavier Gourdon. Introduction to the gamma function. 2002

work page 2002

[67] [67]

Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

Ohad Shamir. Exponential convergence time of gradient descent for one-dimensional deep linear neural networks. arXiv:1809.08587, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[68] [68]

Smith and Quoc V

Samuel L. Smith and Quoc V . Le. A bayesian perspective on generalization and stochastic gradient descent. In ICLR, 2018

work page 2018

[69] [69]

Escaping saddle points with adaptive gradient methods

Matthew Staib, Sashank Reddi, Satyen Kale, Sanjiv Kumar, and Suvrit Sra. Escaping saddle points with adaptive gradient methods. In ICML, pages 5956–5965, 2019

work page 2019

[70] [70]

Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis

Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Normalized ﬂat minima: Exploring scale in- variant deﬁnition of ﬂat minima for neural networks using pac-bayesian analysis. arXiv preprint arXiv:1901.04653, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[71] [71]

Local smoothness in variance reduced optimization

Daniel Vainsencher, Han Liu, and Tong Zhang. Local smoothness in variance reduced optimization. In Advances in Neural Information Processing Systems 28, pages 2179–2187, 2015. 30

work page 2015

[72] [72]

Deep learning: A review

Rocio Vargas, Amir Mosavi, and Ramon Ruiz. Deep learning: A review. Advances in Intelligent Systems and Computing, 5, 08 2017

work page 2017

[73] [73]

Introduction to the non-asymptotic analysis of random matrices

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices . Cambridge University Press, 2012

work page 2012

[74] [74]

High-dimensional probability: An introduction with applications in data science, volume 47

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018

work page 2018

[75] [75]

All of Statistics: A Concise Course in Statistical Inference

Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference . Springer Publishing Company, Incorporated, 2010

work page 2010

[76] [76]

Probability with martingales

David Williams. Probability with martingales. Cambridge university press, 1991

work page 1991

[77] [77]

A Walk with SGD

Chen Xing, Devansh Arpit, Christos Tsirigotis, and Y Bengio. A walk with sgd. arXiv:1802.08770, 02 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[78] [78]

Positively Scale-Invariant Flatness of ReLU Neural Networks

Mingyang Yi, Qi Meng, Wei Chen, Zhi-ming Ma, and Tie-Yan Liu. Positively scale-invariant ﬂatness of relu neural networks. arXiv preprint arXiv:1903.02237, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[79] [79]

Y . Yu, T. Wang, and R. J. Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, apr 2015

work page 2015

[80] [80]

Small nonlinearities in activation functions create bad local minima in neural networks

Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small nonlinearities in activation functions create bad local minima in neural networks. arXiv:1802.03487, February 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018