Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3representative citing papers
Extends functional scaling laws with data quality to derive optimal joint scheduling, proposing Drop-Stable-Rampup that improves accuracy by +1.70 over WSD and +2.98 over cosine decay on a 15B MoE model.
Scale vectors in Pre-Norm LLMs aid optimization via preconditioning on linear layers rather than expressivity, and three lightweight modifications to them reduce terminal loss across model scales.
citing papers explorer
-
Why Muon Outperforms Adam: A Curvature Perspective
Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
-
How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws
Extends functional scaling laws with data quality to derive optimal joint scheduling, proposing Drop-Stable-Rampup that improves accuracy by +1.70 over WSD and +2.98 over cosine decay on a 15B MoE model.
-
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
Scale vectors in Pre-Norm LLMs aid optimization via preconditioning on linear layers rather than expressivity, and three lightweight modifications to them reduce terminal loss across model scales.