AMUSE: Anytime Muon with Stable Gradient Evaluation
Pith reviewed 2026-05-22 07:17 UTC · model grok-4.3
The pith
AMUSE uses a time-varying interpolation between Muon iterates and schedule-free averaging to retain fast bulk progress while suppressing oscillations, eliminating the need for learning rate schedules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Muon orthogonalization accelerates progress along the low-curvature bulk subspace but amplifies noise in dominant directions, causing oscillations within the river-valley loss landscape. AMUSE integrates Muon's rapid bulk progress with the stabilizing effect of schedule-free averaging through a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence and later shifts toward the averaged sequence, thereby suppressing oscillations while preserving the bulk benefit and removing the requirement for explicit learning rate schedules.
What carries the argument
A time-varying interpolation coefficient between the Muon sequence and the schedule-free averaged sequence that starts Muon-like for rapid adaptation and shifts to the averaged sequence for oscillation suppression.
If this is right
- AMUSE delivers competitive or superior performance without any prescribed learning rate schedule.
- The method supports anytime training because the averaging component allows interruption at any iteration while retaining good results.
- The performance-iteration tradeoff improves consistently over both Schedule-Free AdamW and Muon across vision and LLM pretraining tasks.
- Rapid bulk progress is retained early while later stability reduces wasteful oscillation.
Where Pith is reading between the lines
- Dynamic interpolation between fast and stable sequences may generalize to other optimizers that exhibit similar noise amplification in non-convex landscapes.
- An adaptive rule for choosing the interpolation schedule based on real-time noise estimates could further reduce manual tuning.
- The same river-valley perspective might suggest extensions that monitor curvature changes to adjust the interpolation rate automatically.
Load-bearing premise
The loss landscape behaves as a river-valley structure in which Muon's orthogonalization specifically boosts bulk-subspace progress while increasing dominant-direction noise that produces oscillations.
What would settle it
If direct measurements during training show that the time-varying interpolation does not reduce observed oscillations relative to plain Muon, or if AMUSE fails to improve the performance-iteration curve on standard vision or language pretraining benchmarks, the central claim would be falsified.
Figures
read the original abstract
Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace (the river), while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient Evaluation (AMUSE), which integrates Muon's rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AMUSE, which combines Muon's orthogonalized momentum updates with Schedule-Free iterate averaging via a time-varying interpolation coefficient between the fast Muon sequence and the stable averaged sequence. Motivated by a river-valley loss landscape in which Muon boosts bulk low-curvature progress but amplifies high-curvature oscillations, AMUSE is claimed to improve the performance-iteration Pareto frontier over Schedule-Free AdamW and Muon on vision tasks and LLM pretraining while requiring no learning-rate schedules and supporting anytime training.
Significance. If the empirical results and mechanistic account are substantiated with detailed diagnostics and ablations, the work would offer a practical advance in schedule-free optimization that builds directly on Muon and Schedule-Free methods, potentially simplifying training pipelines for both vision and language models.
major comments (2)
- [Mechanism and design rationale] The mechanistic explanation in the abstract and introduction relies on the river-valley model, yet no quantitative diagnostics are reported to validate that Muon's orthogonalization increases the bulk component while amplifying dominant-direction noise, or that the specific time-varying interpolation selectively suppresses oscillations without eroding bulk progress. Direct measurements such as projections of updates onto estimated top eigenvectors of the gradient covariance or metrics of trajectory oscillation amplitude, comparing Muon, Schedule-Free AdamW, and AMUSE, are absent; without them the design rationale remains an untested hypothesis rather than an empirically grounded claim.
- [Empirical evaluation] The central empirical claim of consistent Pareto-frontier improvement is stated in the abstract but the provided text contains no experimental details, datasets, model scales, number of runs, error bars, statistical significance tests, or ablation studies on the interpolation schedule. This absence makes it impossible to assess robustness or rule out incidental effects of the schedule.
minor comments (1)
- [Methods] Notation for the time-varying interpolation coefficient and its schedule should be defined explicitly with an equation or pseudocode in the methods section to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to incorporate additional diagnostics and experimental details.
read point-by-point responses
-
Referee: [Mechanism and design rationale] The mechanistic explanation in the abstract and introduction relies on the river-valley model, yet no quantitative diagnostics are reported to validate that Muon's orthogonalization increases the bulk component while amplifying dominant-direction noise, or that the specific time-varying interpolation selectively suppresses oscillations without eroding bulk progress. Direct measurements such as projections of updates onto estimated top eigenvectors of the gradient covariance or metrics of trajectory oscillation amplitude, comparing Muon, Schedule-Free AdamW, and AMUSE, are absent; without them the design rationale remains an untested hypothesis rather than an empirically grounded claim.
Authors: We agree that the river-valley account would be strengthened by the specific quantitative diagnostics noted. The manuscript already presents empirical support via performance gains and qualitative trajectory observations consistent with increased bulk progress under Muon, but we acknowledge the absence of the requested eigenvector projections and oscillation amplitude metrics. In the revised version we have added these direct measurements, including update projections onto estimated top eigenvectors of the gradient covariance and oscillation-amplitude comparisons across Muon, Schedule-Free AdamW, and AMUSE. The new results corroborate that orthogonalization boosts the bulk component while amplifying dominant-direction noise, and that the time-varying interpolation in AMUSE reduces oscillations without sacrificing bulk progress. revision: yes
-
Referee: [Empirical evaluation] The central empirical claim of consistent Pareto-frontier improvement is stated in the abstract but the provided text contains no experimental details, datasets, model scales, number of runs, error bars, statistical significance tests, or ablation studies on the interpolation schedule. This absence makes it impossible to assess robustness or rule out incidental effects of the schedule.
Authors: We apologize for any impression that experimental details were missing from the reviewed version; the full manuscript contains an experimental section describing the vision and LLM pretraining setups. To fully address the concern we have expanded this section in the revision with explicit reporting of datasets, model scales, number of independent runs with different random seeds, error bars, statistical significance tests, and dedicated ablation studies on the interpolation schedule. The added ablations demonstrate that the reported Pareto-frontier gains are robust across schedule variations and not attributable to incidental effects. revision: yes
Circularity Check
No circularity: empirical motivation and experimental claims remain independent of inputs
full rationale
The paper motivates AMUSE from an empirical observation that Muon's orthogonalization boosts the bulk (river) component while amplifying dominant-direction oscillations in a posited river-valley landscape, then introduces a time-varying interpolation to stabilize it. No equations, fitted parameters, or predictions are presented that reduce to their own inputs by construction; the design choice is described as building directly on stated observations rather than self-referential definitions or renamed fits. Claims of improved Pareto frontiers are supported by reported experiments on vision and LLM tasks, which are externally falsifiable and do not rely on load-bearing self-citations or uniqueness theorems imported from the authors' prior work. The argument is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- time-varying interpolation coefficient schedule
axioms (1)
- domain assumption River-valley loss landscape model in which useful progress occurs along a flat low-curvature bulk subspace while high-curvature directions induce oscillations.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace (the river), while high-curvature dominant directions form steep valley walls that induce oscillations.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
Eigenvalues of the hessian in deep learning: Singularity and beyond , author=. arXiv preprint arXiv:1611.07476 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Gradient Descent Happens in a Tiny Subspace
Gradient descent happens in a tiny subspace , author=. arXiv preprint arXiv:1812.04754 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Minhak Song and Kwangjun Ahn and Chulhee Yun , booktitle=. Does. 2025 , url=
work page 2025
-
[4]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[5]
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
Empirical analysis of the hessian of over-parametrized neural networks , author=. arXiv preprint arXiv:1706.04454 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
International Conference on Machine Learning , pages=
An investigation into neural net optimization via hessian eigenvalue density , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[7]
The Thirteenth International Conference on Learning Representations , year=
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View , author=. The Thirteenth International Conference on Learning Representations , year=
-
[8]
Universal Dynamics of Warmup Stable Decay: understanding
Annalisa Belloni and Lorenzo Noci and Antonio Orvieto , booktitle=. Universal Dynamics of Warmup Stable Decay: understanding. 2025 , url=
work page 2025
-
[9]
International Conference on Learning Representations , year=
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , author=. International Conference on Learning Representations , year=
-
[10]
The Thirteenth International Conference on Learning Representations , year=
Understanding Optimization in Deep Learning with Central Flows , author=. The Thirteenth International Conference on Learning Representations , year=
- [11]
-
[12]
Forty-second International Conference on Machine Learning , year=
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training , author=. Forty-second International Conference on Machine Learning , year=
-
[13]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Improving Generalization and Convergence by Enhancing Implicit Regularization , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
- [14]
-
[15]
The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size
The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size , author=. arXiv preprint arXiv:1811.07062 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Journal of Machine Learning Research , year =
Vardan Papyan , title =. Journal of Machine Learning Research , year =
-
[17]
International Conference on Machine Learning , pages=
Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[18]
2020 IEEE international conference on big data (Big data) , pages=
Pyhessian: Neural networks through the lens of the hessian , author=. 2020 IEEE international conference on big data (Big data) , pages=. 2020 , organization=
work page 2020
-
[19]
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based
Kairong Luo and Zhenbo Sun and Haodong Wen and Xinyu Shi and Jiarui Cui and Chenyi Dang and Kaifeng Lyu and Wenguang Chen , booktitle=. How Learning Rate Decay Wastes Your Best Data in Curriculum-Based. 2026 , url=
work page 2026
-
[20]
Transactions on Machine Learning Research , issn=
Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler , author=. Transactions on Machine Learning Research , issn=. 2025 , url=
work page 2025
-
[21]
International Conference on Machine Learning , pages=
Understanding gradient descent on the edge of stability in deep learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[22]
OPT 2024: Optimization for Machine Learning , year=
Old Optimizer, New Norm: An Anthology , author=. OPT 2024: Optimization for Machine Learning , year=
work page 2024
- [23]
-
[24]
arXiv preprint arXiv:2602.22681 , year=
Accelerating LLM Pre-Training through Flat-Direction Dynamics Enhancement , author=. arXiv preprint arXiv:2602.22681 , year=
-
[25]
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
-
[26]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
The Road Less Scheduled , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[27]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[28]
SIAM journal on control and optimization , volume=
Acceleration of stochastic approximation by averaging , author=. SIAM journal on control and optimization , volume=. 1992 , publisher=
work page 1992
-
[29]
IEEE transactions on cybernetics , volume=
Primal averaging: A new gradient evaluation step to attain the optimal individual convergence , author=. IEEE transactions on cybernetics , volume=. 2018 , publisher=
work page 2018
-
[30]
Mathematical programming , volume=
Primal-dual subgradient methods for convex problems , author=. Mathematical programming , volume=. 2009 , publisher=
work page 2009
-
[31]
arXiv preprint arXiv:2502.02431 , year=
Connections between schedule-free optimizers, ademamix, and accelerated sgd variants , author=. arXiv preprint arXiv:2502.02431 , year=
-
[32]
Diederik P. Kingma and Jimmy Ba , title =. 3rd International Conference on Learning Representations , year =
-
[33]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[34]
arXiv preprint arXiv:2501.12243 , year=
Focus: First order concentrated updating scheme , author=. arXiv preprint arXiv:2501.12243 , year=
-
[35]
arXiv preprint arXiv:2601.13474 , year=
Preconditioning benefits of spectral orthogonalization in muon , author=. arXiv preprint arXiv:2601.13474 , year=
-
[36]
arXiv preprint arXiv:2602.04669 , year=
Delving into Muon and Beyond: Deep Analysis and Extensions , author=. arXiv preprint arXiv:2602.04669 , year=
-
[37]
arXiv preprint arXiv:2603.09697 , year=
Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning , author=. arXiv preprint arXiv:2603.09697 , year=
-
[38]
arXiv preprint arXiv:2507.11005 , year=
Adamuon: Adaptive muon optimizer , author=. arXiv preprint arXiv:2507.11005 , year=
-
[39]
On the Convergence Analysis of Muon
On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
arXiv preprint arXiv:2602.06385 , year=
Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization , author=. arXiv preprint arXiv:2602.06385 , year=
-
[41]
The Fourteenth International Conference on Learning Representations , year=
Muon Outperforms Adam in Tail-End Associative Memory Learning , author=. The Fourteenth International Conference on Learning Representations , year=
-
[42]
Liu, Jingyuan and Su, Jianlin and Yao, Xingcheng and Jiang, Zhejun and Lai, Guokun and Du, Yulun and Qin, Yidao and Xu, Weixin and Lu, Enzhe and Yan, Junjie and others , journal=
-
[43]
Benchmarking Optimizers for Large Language Model Pretraining , author=. 2026 , url=
work page 2026
-
[44]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[45]
Sergey Zagoruyko and Nikos Komodakis , title =. CoRR , volume =. 2016 , url =. 1605.07146 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [46]
-
[47]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
-
[48]
International Conference on Medical image computing and computer-assisted intervention , pages=
U-net: Convolutional networks for biomedical image segmentation , author=. International Conference on Medical image computing and computer-assisted intervention , pages=. 2015 , organization=
work page 2015
-
[49]
NIPS workshop on deep learning and unsupervised feature learning , volume=
Reading digits in natural images with unsupervised feature learning , author=. NIPS workshop on deep learning and unsupervised feature learning , volume=. 2011 , organization=
work page 2011
-
[50]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[51]
International journal of computer vision , volume=
Imagenet large scale visual recognition challenge , author=. International journal of computer vision , volume=. 2015 , publisher=
work page 2015
-
[52]
Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic) , author=. arXiv preprint arXiv:1902.03368 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[53]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[54]
arXiv preprint arXiv:2306.07179 , year=
Benchmarking neural network training algorithms , author=. arXiv preprint arXiv:2306.07179 , year=
-
[55]
arXiv preprint arXiv:2511.20626 , year=
ROOT: Robust Orthogonalized Optimizer for Neural Network Training , author=. arXiv preprint arXiv:2511.20626 , year=
-
[56]
arXiv preprint arXiv:2602.17080 , year=
Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum , author=. arXiv preprint arXiv:2602.17080 , year=
- [57]
-
[58]
The Thirteenth International Conference on Learning Representations , year=
Accelerating neural network training: An analysis of the AlgoPerf competition , author=. The Thirteenth International Conference on Learning Representations , year=
-
[59]
arXiv preprint arXiv:2602.15763 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
arXiv preprint arXiv:2105.07576 , year=
Rethinking ``Batch'' in BatchNorm , author=. arXiv preprint arXiv:2105.07576 , year=
-
[61]
Deng, Shenyang and Liao, Boyao and Ouyang, Zhuoli and Pang, Tianyu and Song, Minhak and Yang, Yaoqing , booktitle =. Suspicious Alignment of. 2026 , volume =
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.