Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Periodic outer-momentum restarts in two-phase optimizers exploit phase cancellation in a linearized NTK model to widen stable learning-rate and momentum ranges in language-model pretraining.
HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions than prior methods.
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
citing papers explorer
No citing papers match the current filters.