SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
hub Canonical reference
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.
hub tools
citation-role summary
citation-polarity summary
fields
cs.LG 29 math.OC 2 quant-ph 2 cond-mat.dis-nn 1 cond-mat.stat-mech 1 cs.CR 1 cs.IT 1 stat.ML 1roles
background 5polarities
background 5representative citing papers
PCD is a new gradient-based optimizer for hierarchical multi-objective problems that prioritizes primary descent with minimal controlled distortion for secondary objectives via a single tau parameter.
Dead-Direction Signatures provide closed-form spectral readings of dead directions in network activations and gradients that track rank deficits at singular minima, offering a cheap directional alternative to SGLD-based LLC.
The normalized inverse-scale direction of LayerNorm's affine parameters is an exact algebraic kernel of the post-final-norm centred activation covariance for any input distribution in LayerNorm transformers.
Introduces thermodynamic free-energy signatures and spectral form factors from attention Laplacians for hallucination detection, with stability proofs, expressiveness results, a PAC bound, and empirical AUROC gains over baselines.
Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.
Depth induces an implicit low-rank bias in deep unconstrained feature models trained with unregularized multiclass cross-entropy, promoting softmax codes over neural collapse via more efficient norm propagation.
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
Backdoors can be embedded in ResNet and ViT models as statistically indistinguishable latent directions, reducing cryptographic undetectability to an intractable hypothesis test over parameter distributions.
Hessian Surgery perturbs trained model weights along Hessian spike eigenvectors via a sensitivity matrix and constrained optimization to rebalance per-class accuracy on CIFAR-10 and ISIC-2019 without retraining.
Permutation symmetries generate permutation saddles and equal-loss valleys linking equivalent global minima, yielding a lower bound on symmetry-induced critical points.
Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.
FGN is a positive semidefinite under-approximation of the multiclass GGN obtained by exact decomposition into true-vs-rest and within-competitor terms, exact for binary classification and implemented via matrix-free conjugate gradient on a whitened row-space system.
The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap flow equation.
Elitist (1+M) genetic algorithms follow the loss gradient via mutation-selection, slowed only by noise in the effective-rank directions of the Hessian rather than the full parameter count.
Presents a stochastic gradient algorithm for non-separable optimization with local convergence guarantees under smoothness assumptions.
Derives second-order path-kernel interpolation formulas for gradient descent, SGD, and momentum training, adding curvature terms and a concentration estimate around the expected prediction.
A geometric classification of stationary points on neuron-splitting plateaus in two-layer NN loss landscapes using the inner Hessian.
Momentum in Muon functions as a spectral filter on signal-plus-perturbation gradients, enlarging the gap to stabilize singular subspaces before orthogonalization and outperforming the reverse order.
Worker-average gaps in Local SGD serve as a Hessian-free estimator of the dominant sharp subspace by capturing gradient alignment with high-curvature directions.
Stochastic layer-wise Hessian trace estimator using Hutchinson method and Hessian-vector products detects label memorization in CNNs with high empirical power.
Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
CrispEdit edits LLMs via low-curvature projections using Bregman divergence and K-FAC approximations, achieving high edit success with under 1% average capability degradation.
citing papers explorer
No citing papers match the current filters.