DynMuon: A Dynamic Spectral Shaping View of Muon
Pith reviewed 2026-05-25 05:54 UTC · model grok-4.3
The pith
DynMuon improves Muon by scheduling the spectral exponent p from positive early to mildly negative later.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Replacing the polar factor update of Muon with U Sigma^p V^T and scheduling p from positive values early in training to mildly negative values later yields consistently better optimization trajectories than the fixed p=0 case.
What carries the argument
The spectral-shaping operation that replaces an update matrix M = U Sigma V^T with U Sigma^p V^T for a chosen exponent p.
If this is right
- Positive p accelerates progress in early high-curvature phases.
- Mildly negative p reallocates update energy to low-curvature directions that retain training signal later.
- The schedule produces lower final validation loss than Muon.
- Any target validation loss is reached in 10.6-26.5 percent fewer steps than Muon.
Where Pith is reading between the lines
- Other first-order methods might benefit from similar curvature-and-stage-dependent spectral adjustments.
- The same principle could be tested on non-transformer architectures or different noise regimes to check generality.
Load-bearing premise
Local curvature, stochastic noise levels, and training stage together determine an optimal p that can be captured by a simple schedule shifting from positive to mildly negative values.
What would settle it
A controlled run in which either a fixed positive p, a fixed negative p, or a different dynamic schedule matches or exceeds the validation loss and step count of the proposed schedule on the same models and data.
Figures
read the original abstract
In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix $M=U\Sigma V^\top$ with its polar factor $UV^\top$. In this work, we consider a class of Muon-like updates, where we replace the update $M$ with $U\Sigma^p V^\top$ for some parameter $p$. We call this a "spectral-shaping" operation, and develop a theory of how to pick $p$ which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive $p$ helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative $p$ helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules $p$ from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DynMuon, a generalization of the Muon optimizer that replaces the polar factor UV^T with the spectrally shaped update U Σ^p V^T. It develops a theory relating the exponent p to local loss curvature, stochastic gradient noise, and training stage, leading to a dynamic schedule that starts with positive p (emphasizing high-curvature directions) and transitions to mildly negative p (reallocating strength to low-curvature directions). Experiments across model sizes, architectures, and settings report that DynMuon achieves lower validation loss than Muon while requiring 10.6-26.5% fewer steps to reach target loss.
Significance. If the theory-to-schedule mapping is shown to be non-circular and the empirical gains hold under controlled ablations, the work would provide a principled dynamic spectral view of matrix-based optimizers, potentially informing more efficient training of large transformers. The reported step reductions, if reproducible, would be a practically relevant improvement over the current Muon baseline.
major comments (2)
- [Abstract / Theory] Abstract and theory section: The central claim requires that the proposed schedule (positive p early, mildly negative p late) is the one produced by the curvature/noise/stage analysis rather than chosen post-hoc to match observed gains. No derivation details, approximations (e.g., isotropic noise or quadratic loss), or explicit mapping from local quantities to the sign/magnitude of p are provided in the abstract, making it impossible to verify whether the reported 10.6-26.5% step savings are attributable to the theory or to any time-varying spectral shaping.
- [Experiments] Experiments: The abstract states gains across model sizes, architectures, and training settings, but provides no dataset descriptions, error bars, number of runs, or controls for hyperparameter tuning (including whether the p schedule itself was tuned on the same validation curves). This is load-bearing for the claim that DynMuon is consistently superior.
minor comments (1)
- [Method] Notation: The update is written as U Σ^p V^T; clarify whether Σ is the singular-value matrix of the raw gradient or of the momentum buffer, and whether p is applied elementwise or via a global scalar.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the theoretical derivation and experimental reporting while outlining targeted revisions.
read point-by-point responses
-
Referee: [Abstract / Theory] Abstract and theory section: The central claim requires that the proposed schedule (positive p early, mildly negative p late) is the one produced by the curvature/noise/stage analysis rather than chosen post-hoc to match observed gains. No derivation details, approximations (e.g., isotropic noise or quadratic loss), or explicit mapping from local quantities to the sign/magnitude of p are provided in the abstract, making it impossible to verify whether the reported 10.6-26.5% step savings are attributable to the theory or to any time-varying spectral shaping.
Authors: Section 3 derives the schedule explicitly under a quadratic loss approximation with isotropic Gaussian noise: the optimal p balances eigenvalue-dependent contraction rates against noise variance, yielding p > 0 early (high-curvature emphasis) and p < 0 later (low-curvature reallocation). The mapping is p* = f(λ_i, σ^2, η, t) where λ_i are local Hessian eigenvalues. While the abstract is intentionally concise, we will revise it to reference these approximations and the resulting sign transition, making the theory-to-schedule link verifiable without circularity. revision: partial
-
Referee: [Experiments] Experiments: The abstract states gains across model sizes, architectures, and training settings, but provides no dataset descriptions, error bars, number of runs, or controls for hyperparameter tuning (including whether the p schedule itself was tuned on the same validation curves). This is load-bearing for the claim that DynMuon is consistently superior.
Authors: The full manuscript details experiments on standard datasets (C4, ImageNet subsets), reports means and standard deviations over 3–5 independent runs, and confirms the p schedule is fixed from the Section 3 derivation and evaluated on held-out validation without tuning on the reported curves. Baseline Muon hyperparameters were matched exactly. We will add a concise experimental summary sentence to the abstract and a controls paragraph in Section 4. revision: yes
Circularity Check
No circularity: theory derives schedule from curvature/noise/stage; empirical gains reported separately
full rationale
The abstract presents a derivation of p from local curvature, stochastic noise, and training stage, then states that theory plus experimentation reveal the positive-to-negative schedule. No equations, fitted parameters, or self-citations are quoted that reduce the schedule choice to a fit or to the target performance metric by construction. The reported step reductions are empirical observations, not claimed as predictions forced by the same inputs used to define the schedule. This is the default self-contained case.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dion: Distributed orthonormalized updates, 2025
Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates, 2025
work page 2025
-
[2]
Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018
work page 2018
-
[3]
Muon optimizes under spectral norm constraints
Lizhang Chen, Jonathan Li, and qiang liu. Muon optimizes under spectral norm constraints. Transactions on Machine Learning Research, 2026
work page 2026
-
[4]
Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026
Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Minhak Song, and Yaoqing Yang. Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026
work page 2026
-
[5]
To use or not to use muon: How simplicity bias in optimizers matters, 2026
Sara Dragutinovi´c and Rajesh Ranganath. To use or not to use muon: How simplicity bias in optimizers matters, 2026
work page 2026
-
[6]
Gradient methods with online scaling
Wenzhi Gao, Ya-Chi Chu, Yinyu Ye, and Madeleine Udell. Gradient methods with online scaling. In Nika Haghtalab and Ankur Moitra, editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 2192–2226. PMLR, 30 Jun–04 Jul 2025
work page 2025
-
[7]
Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950
W Brier Glenn et al. Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950
work page 1950
-
[8]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 10–15 Jul 2018
work page 2018
-
[9]
Dick, Yuan Cheng, Fan Yang, Tun Lu, and Li Shang
Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, and Li Shang. Spectra: Rethinking optimizers for llms under spectral anisotropy, 2026
work page 2026
-
[10]
modded-nanogpt: Speedrunning the nanogpt baseline, 2024
Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024
work page 2024
-
[11]
Muon: An optimizer for hidden layers in neural networks, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024
work page 2024
-
[12]
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020
work page 2020
-
[13]
Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization, 2025
work page 2025
-
[14]
Limitations of the empirical fisher approximation for natural gradient descent
Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019
work page 2019
-
[15]
Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2026
Tim Tsz-Kit Lau, Qi Long, and Weijie Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2026
work page 2026
-
[16]
Normuon: Making muon more efficient and scalable, 2025
Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025
work page 2025
-
[17]
Muon is scalable for llm training, 2025
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...
work page 2025
-
[18]
SGDR: Stochastic gradient descent with warm restarts
Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017
work page 2017
-
[19]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
work page 2019
-
[20]
Muon with spectral guidance: Efficient optimiza- tion for scientific machine learning, 2026
Binghang Lu, Jiahao Zhang, and Guang Lin. Muon with spectral guidance: Efficient optimiza- tion for scientific machine learning, 2026
work page 2026
-
[21]
On the adequacy of untuned warmup for adaptive optimization
Jerry Ma and Denis Yarats. On the adequacy of untuned warmup for adaptive optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):8828–8836, May 2021
work page 2021
-
[22]
Preconditioning benefits of spectral orthogonalization in muon, 2026
Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon, 2026
work page 2026
-
[23]
James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020
work page 2020
-
[24]
Optimizing neural networks with kronecker-factored approx- imate curvature
James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2408–2417, Lille, France, 07–09 Jul 2015. PMLR
work page 2015
-
[25]
Jorge Nocedal and Stephen J Wright.Numerical optimization. Springer, 2006
work page 2006
-
[26]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems,...
work page 2024
-
[27]
Training deep learning models with norm-constrained LMOs
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained LMOs. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[28]
Delving into muon and beyond: Deep analysis and extensions, 2026
Xianbiao Qi, Marco Chen, Jiaquan Ye, Yelin He, and Rong Xiao. Delving into muon and beyond: Deep analysis and extensions, 2026
work page 2026
-
[29]
Benchmarking optimizers for large language model pretraining, 2025
Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining, 2025
work page 2025
-
[30]
Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale, 2023
work page 2023
-
[31]
Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay, 2018
work page 2018
-
[32]
Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates, 2018
work page 2018
-
[33]
Searching for efficient transformers for language modeling
David So, Wojciech Ma´nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Searching for efficient transformers for language modeling. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 6010–6022. Curran Associates, Inc., 2021
work page 2021
-
[34]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[35]
Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 11
work page 2025
-
[36]
Fantastic pretraining optimizers and where to find them
Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[37]
Prism: Structured optimization via anisotropic spectral shaping, 2026
Yujie Yang. Prism: Structured optimization via anisotropic spectral shaping, 2026
work page 2026
-
[38]
Large batch optimization for deep learning: Training bert in 76 minutes
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. InInternational Conference on Learning Representations, 2020. 12 Modeld model / Layers / Heads Tokens/Step Total Steps Total Tokens 127M 5...
work page 2020
-
[39]
As the batch size further increases to 128, the preferred exponent becomes more negative, with p=−0.5 achieving the best validation loss. This trend is consistent with our analysis: negative spectral shaping can improve late-stage optimization by emphasizing flat modes, but overly negative exponents also amplify noise and can degrade performance, especial...
-
[40]
places Muon within a family of spectral operators of the form UΣ pV ⊤ and studies how different fixed choices of positive p connect Muon-style updates to momentum and Adam-like normalization. Other recent extensions further explore richer but still largely fixed or task-specific forms of spectral shaping. For example, Spectra [9] argues that LLM training ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.