pith. sign in

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it
abstract

To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in non-batch normalized networks, we observe the rapid appearance of large isolated eigenvalues in the spectrum, along with a surprising concentration of the gradient in the corresponding eigenspaces. In batch normalized networks, these two effects are almost absent. We characterize these effects, and explain how they affect optimization speed through both theory and experiments. As part of this work, we adapt advanced tools from numerical linear algebra that allow scalable and accurate estimation of the entire Hessian spectrum of ImageNet-scale neural networks; this technique may be of independent interest in other applications.

fields

cs.LG 3 cs.CL 2

representative citing papers

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

citing papers explorer

Showing 5 of 5 citing papers.

  • Non-normal spectral signatures of instability in neural network training dynamics cs.LG · 2026-05-22 · unverdicted · none · ref 7 · internal anchor

    Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.

  • Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 76 · internal anchor

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

  • Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 164

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  • A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 106

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  • Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization cs.LG · 2019-07-24 · unverdicted · none · ref 20 · internal anchor

    Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.