An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Behrooz Ghorbani; Shankar Krishnan; Ying Xiao

arxiv: 1901.10159 · v1 · pith:NCJLQUIFnew · submitted 2019-01-29 · 💻 cs.LG · stat.ML

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Behrooz Ghorbani , Shankar Krishnan , Ying Xiao This is my paper

classification 💻 cs.LG stat.ML

keywords networksoptimizationhessianneuralspectrumdeepeffectsentire

0 comments

read the original abstract

To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in non-batch normalized networks, we observe the rapid appearance of large isolated eigenvalues in the spectrum, along with a surprising concentration of the gradient in the corresponding eigenspaces. In batch normalized networks, these two effects are almost absent. We characterize these effects, and explain how they affect optimization speed through both theory and experiments. As part of this work, we adapt advanced tools from numerical linear algebra that allow scalable and accurate estimation of the entire Hessian spectrum of ImageNet-scale neural networks; this technique may be of independent interest in other applications.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization
math.NA 2026-06 unverdicted novelty 7.0

HiMuon partitions momentum-gradient matrices into T x T tiles, runs independent Newton-Schulz iterations on each tile, and reassembles the results, reducing leading cost to O(H W T K) while defining a local rather tha...
Non-normal spectral signatures of instability in neural network training dynamics
cs.LG 2026-05 unverdicted novelty 7.0

Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.
Exposing the Illusion of Erasure in Knowledge Editing for LLMs
cs.LG 2026-06 unverdicted novelty 6.0

Knowledge editing methods redistribute and suppress rather than overwrite facts in LLMs, creating narrow vulnerable regions in representation space that adversarial prompts can exploit.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization
cs.LG 2019-07 unverdicted novelty 4.0

Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.