Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
read the original abstract
We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data.
This paper has not been read by Pith yet.
Forward citations
Cited by 12 Pith papers
-
The Implicit Bias of Depth: From Neural Collapse to Softmax Codes
Depth induces an implicit low-rank bias in deep unconstrained feature models trained with unregularized multiclass cross-entropy, promoting softmax codes over neural collapse via more efficient norm propagation.
-
AMUSE: Anytime Muon with Stable Gradient Evaluation
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
-
Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks
Backdoors can be realized as statistically natural latent directions in modern neural networks, achieving high attack success with negligible clean accuracy loss and resisting existing defenses.
-
Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape
Permutation symmetries generate permutation saddles and equal-loss valleys linking equivalent global minima, yielding a lower bound on symmetry-induced critical points.
-
Mechanistic Anomaly Detection via Functional Attribution
Functional attribution with influence functions detects anomalous mechanisms in neural networks, achieving SOTA backdoor detection (average DER 0.93) on vision benchmarks and improvements on LLMs.
-
GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry
GIST recovers a task-specific low-dimensional subspace from validation gradients using SVD and scores training examples by their alignment within this coupled subspace for LoRA-based instruction tuning.
-
Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks
Derives optimal low-rank subspace for Laplace approx in BNNs, provides scalable outperforming version, and new comparison metric.
-
AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
-
Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks
A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.
-
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
-
On the Convergence Analysis of Muon
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
-
Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization
Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.