An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , Year = · 2019 · cs.LG · arXiv 1901.10159

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in non-batch normalized networks, we observe the rapid appearance of large isolated eigenvalues in the spectrum, along with a surprising concentration of the gradient in the corresponding eigenspaces. In batch normalized networks, these two effects are almost absent. We characterize these effects, and explain how they affect optimization speed through both theory and experiments. As part of this work, we adapt advanced tools from numerical linear algebra that allow scalable and accurate estimation of the entire Hessian spectrum of ImageNet-scale neural networks; this technique may be of independent interest in other applications.

representative citing papers

Non-normal spectral signatures of instability in neural network training dynamics

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

cs.LG · 2019-07-24 · unverdicted · novelty 4.0

Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.

citing papers explorer

Showing 5 of 5 citing papers.

Non-normal spectral signatures of instability in neural network training dynamics cs.LG · 2026-05-22 · unverdicted · none · ref 7 · internal anchor
Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 76 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 164
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 106
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization cs.LG · 2019-07-24 · unverdicted · none · ref 20 · internal anchor
Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

fields

years

verdicts

representative citing papers

citing papers explorer