Neural networks admit large families of approximately equivalent solutions via neuron identifiability even without structural symmetry, enabling linear low-loss merging paths without prior alignment.
A modern look at the relationship between sharpness and generalization
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 5years
2026 5representative citing papers
Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.
Low-rank pre-training methods converge to geometrically and spectrally distinct basins and show diverging activations compared to full-rank training at 60M-350M scales.
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
citing papers explorer
-
Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability
Neural networks admit large families of approximately equivalent solutions via neuron identifiability even without structural symmetry, enabling linear low-loss merging paths without prior alignment.
-
Pointwise Generalization in Deep Neural Networks
Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.
-
Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
Low-rank pre-training methods converge to geometrically and spectrally distinct basins and show diverging activations compared to full-rank training at 60M-350M scales.
-
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.