AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
arXiv preprint arXiv:2511.20626 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 5years
2026 5verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
Zeta applies coordinate whitening followed by spectral whitening in a fixed order to reduce orthogonalization error in matrix optimization for neural networks.
Entry-wise clipping achieves spectral control of gradients via localization under heavy-tailed contamination, with O(ε^{-4}) convergence and empirical savings on NanoGPT pretraining.
citing papers explorer
-
AMUSE: Anytime Muon with Stable Gradient Evaluation
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
-
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
-
Dimension-Free Saddle-Point Escape in Muon
Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
-
Zeta: Dual Whitening for Matrix Optimization via Coordinate-Adaptive Preconditioning
Zeta applies coordinate whitening followed by spectral whitening in a fixed order to reduce orthogonalization error in matrix optimization for neural networks.
-
Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?
Entry-wise clipping achieves spectral control of gradients via localization under heavy-tailed contamination, with O(ε^{-4}) convergence and empirical savings on NanoGPT pretraining.