hub Canonical reference

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

· 2017 · cs.LG · arXiv 1706.04454

Canonical reference. 100% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Non-normal spectral signatures of instability in neural network training dynamics

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Depth induces an implicit low-rank bias in deep unconstrained feature models trained with unregularized multiclass cross-entropy, promoting softmax codes over neural collapse via more efficient norm propagation.

AMUSE: Anytime Muon with Stable Gradient Evaluation

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.

Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks

cs.CR · 2026-05-13 · unverdicted · novelty 7.0

Backdoors can be realized as statistically natural latent directions in modern neural networks, achieving high attack success with negligible clean accuracy loss and resisting existing defenses.

Hessian Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

Hessian Surgery perturbs trained model weights along Hessian spike eigenvectors via a sensitivity matrix and constrained optimization to rebalance per-class accuracy on CIFAR-10 and ISIC-2019 without retraining.

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

cs.LG · 2019-07-05 · conditional · novelty 7.0

Permutation symmetries generate permutation saddles and equal-loss valleys linking equivalent global minima, yielding a lower bound on symmetry-induced critical points.

Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

math.OC · 2026-05-09 · unverdicted · novelty 7.0

Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.

Fast Gauss-Newton for Multiclass Cross-Entropy

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

FGN is a positive semidefinite under-approximation of the multiclass GGN obtained by exact decomposition into true-vs-rest and within-competitor terms, exact for binary classification and implemented via matrix-free conjugate gradient on a whitened row-space system.

The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap flow equation.

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.

CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

cs.LG · 2026-02-17 · unverdicted · novelty 6.0

CrispEdit edits LLMs via low-curvature projections using Bregman divergence and K-FAC approximations, achieving high edit success with under 1% average capability degradation.

Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

All rank-monotone pruning scorers converge to identical accuracy at fixed sparsity, but non-monotone features with sparsity-dependent complexity can escape this plateau, as shown by the SICS hypothesis on ViT-Small/CIFAR-10.

Quantum Tilted Loss in Variational Optimization: Theory and Applications

quant-ph · 2026-05-04 · unverdicted · novelty 6.0

QTL unifies expectation-value minimization with CVaR and Gibbs heuristics under one tunable operator, amplifying gradients in structured cases while preserving global minima and shifting the bottleneck to measurement variance.

Generalization at the Edge of Stability

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.

Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations

cs.IT · 2026-04-16 · unverdicted · novelty 6.0

A correlation-based taxonomy unifies existing FL compression methods, experiments show correlation strengths vary by task and architecture, and adaptive mode-switching designs are proposed to exploit this.

Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

cond-mat.dis-nn · 2026-04-03 · unverdicted · novelty 6.0

In overparameterized quadratic networks, one-pass SGD escapes generalization plateaus only modestly faster and selects the initialization-closest zero-loss solution due to a conserved quantity in the overlap ODEs.

Anytime Training with Schedule-Free Spectral Optimization

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

cs.LG · 2026-03-20 · conditional · novelty 5.0

RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.

On the Convergence Analysis of Muon

stat.ML · 2025-05-29 · unverdicted · novelty 5.0

Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

cs.LG · 2019-06-26 · unverdicted · novelty 5.0

GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

cs.LG · 2019-07-24 · unverdicted · novelty 4.0

Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.

Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

cs.LG · 2026-01-31

citing papers explorer

Showing 23 of 23 citing papers.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds cs.LG · 2026-05-07 · unverdicted · none · ref 27
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Non-normal spectral signatures of instability in neural network training dynamics cs.LG · 2026-05-22 · unverdicted · none · ref 8 · internal anchor
Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.
The Implicit Bias of Depth: From Neural Collapse to Softmax Codes cs.LG · 2026-05-21 · unverdicted · none · ref 103 · internal anchor
Depth induces an implicit low-rank bias in deep unconstrained feature models trained with unregularized multiclass cross-entropy, promoting softmax codes over neural collapse via more efficient norm propagation.
AMUSE: Anytime Muon with Stable Gradient Evaluation cs.LG · 2026-05-21 · unverdicted · none · ref 5 · internal anchor
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks cs.CR · 2026-05-13 · unverdicted · none · ref 28 · internal anchor
Backdoors can be realized as statistically natural latent directions in modern neural networks, achieving high attack success with negligible clean accuracy loss and resisting existing defenses.
Hessian Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation cs.LG · 2026-05-08 · unverdicted · none · ref 1 · 2 links · internal anchor
Hessian Surgery perturbs trained model weights along Hessian spike eigenvectors via a sensitivity matrix and constrained optimization to rebalance per-class accuracy on CIFAR-10 and ISIC-2019 without retraining.
Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape cs.LG · 2019-07-05 · conditional · none · ref 28 · internal anchor
Permutation symmetries generate permutation saddles and equal-loss valleys linking equivalent global minima, yielding a lower bound on symmetry-induced critical points.
Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets math.OC · 2026-05-09 · unverdicted · none · ref 9
Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.
Fast Gauss-Newton for Multiclass Cross-Entropy cs.LG · 2026-05-07 · unverdicted · none · ref 33
FGN is a positive semidefinite under-approximation of the multiclass GGN obtained by exact decomposition into true-vs-rest and within-competitor terms, exact for binary classification and implemented via matrix-free conjugate gradient on a whitened row-space system.
The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression cs.LG · 2026-04-08 · unverdicted · none · ref 14
The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap flow equation.
Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training cs.LG · 2026-03-30 · unverdicted · none · ref 21 · internal anchor
Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing cs.LG · 2026-02-17 · unverdicted · none · ref 22 · internal anchor
CrispEdit edits LLMs via low-curvature projections using Bregman divergence and K-FAC approximations, achieving high edit success with under 1% average capability degradation.
Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features cs.LG · 2026-05-10 · unverdicted · none · ref 21
All rank-monotone pruning scorers converge to identical accuracy at fixed sparsity, but non-monotone features with sparsity-dependent complexity can escape this plateau, as shown by the SICS hypothesis on ViT-Small/CIFAR-10.
Quantum Tilted Loss in Variational Optimization: Theory and Applications quant-ph · 2026-05-04 · unverdicted · none · ref 48
QTL unifies expectation-value minimization with CVaR and Gibbs heuristics under one tunable operator, amplifying gradients in structured cases while preserving global minima and shifting the bottleneck to measurement variance.
Generalization at the Edge of Stability cs.LG · 2026-04-21 · unverdicted · none · ref 63
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.
Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations cs.IT · 2026-04-16 · unverdicted · none · ref 32
A correlation-based taxonomy unifies existing FL compression methods, experiments show correlation strengths vary by task and architecture, and adaptive mode-switching designs are proposed to exploit this.
Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks cond-mat.dis-nn · 2026-04-03 · unverdicted · none · ref 49
In overparameterized quadratic networks, one-pass SGD escapes generalization plateaus only modestly faster and selects the initialization-closest zero-loss solution due to a conserved quantity in the overlap ODEs.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 53 · internal anchor
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization cs.LG · 2026-03-20 · conditional · none · ref 29 · internal anchor
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
On the Convergence Analysis of Muon stat.ML · 2025-05-29 · unverdicted · none · ref 20 · internal anchor
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD cs.LG · 2019-06-26 · unverdicted · none · ref 20 · internal anchor
GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.
Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization cs.LG · 2019-07-24 · unverdicted · none · ref 65 · internal anchor
Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.
Depth, Not Data: An Analysis of Hessian Spectral Bifurcation cs.LG · 2026-01-31 · unreviewed · ref 8 · internal anchor

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer