pith. sign in

arxiv: 2605.17109 · v2 · pith:HETTGOSPnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

DynMuon: A Dynamic Spectral Shaping View of Muon

Pith reviewed 2026-05-25 05:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Muon optimizerspectral shapingdynamic exponenttransformer trainingvalidation lossoptimization schedule
0
0 comments X

The pith

DynMuon improves Muon by scheduling the spectral exponent p from positive early to mildly negative later.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Muon updates can be generalized by applying a spectral shaping step that raises singular values to a power p, and that the best p changes with training progress. Positive p early on emphasizes directions of high curvature to speed up signal contraction, while mildly negative p later reallocates strength toward low-curvature directions that still carry useful information. The choice of p is guided by local curvature of the loss, stochastic gradient noise, and the current training stage. Experiments across model sizes, architectures, and settings show the resulting dynamic schedule produces lower validation loss than standard Muon and reaches any given target loss in 10.6 to 26.5 percent fewer steps.

Core claim

Replacing the polar factor update of Muon with U Sigma^p V^T and scheduling p from positive values early in training to mildly negative values later yields consistently better optimization trajectories than the fixed p=0 case.

What carries the argument

The spectral-shaping operation that replaces an update matrix M = U Sigma V^T with U Sigma^p V^T for a chosen exponent p.

If this is right

  • Positive p accelerates progress in early high-curvature phases.
  • Mildly negative p reallocates update energy to low-curvature directions that retain training signal later.
  • The schedule produces lower final validation loss than Muon.
  • Any target validation loss is reached in 10.6-26.5 percent fewer steps than Muon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other first-order methods might benefit from similar curvature-and-stage-dependent spectral adjustments.
  • The same principle could be tested on non-transformer architectures or different noise regimes to check generality.

Load-bearing premise

Local curvature, stochastic noise levels, and training stage together determine an optimal p that can be captured by a simple schedule shifting from positive to mildly negative values.

What would settle it

A controlled run in which either a fixed positive p, a fixed negative p, or a different dynamic schedule matches or exceeds the validation loss and step count of the proposed schedule on the same models and data.

Figures

Figures reproduced from arXiv: 2605.17109 by Fangzhou Wu, Qiuyi Zhang, Rikhav Shah, Sandeep Silwal.

Figure 1
Figure 1. Figure 1: Validation of the mode-wise model predictions. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training performance of stage-dependent spectral shaping. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Validation loss trajectories across three model scales trained on 10B tokens. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: DynMuon outperforms Muon over architectures, training-token budgets, and learning rates. stable advantage [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional experiments for DynMuon across corpora, pmin choices, and spectral-shaping implementations. Left: DynMuon outperforms Muon on FineWeb-Edu. Middle: mildly negative pmin values perform best. Right: our spectral shaping approximations closely tracks exact SVD. 10000 15000 20000 Step 3.20 3.25 3.30 3.35 Validation Loss Ablation of Spectral Scheduling Strategy Muon (p = 0) Logistic schedule Abrupt 1!… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of spectral scheduling strategies and logistic schedule parameters [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: AdamW learning-rate sweep on the 127M GPT-style model, with the best validation loss [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Empirical support for curvature stability and gradient-curvature alignment. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Trends in the estimated noise exponent βt and the noise-curvature fit R2 during training. The power-law relationship between noise and curvature remains stable and pronounced throughout training. -0.5 -0.25 -0.1 0 Spectral Exponent p 9.05 9.10 9.15 Best Validation Loss Best Validation Loss vs. p B=2 B=4 B=8 B=16 B=32 B=64 B=128 2 4 8 16 32 64 128 Batch Size -0.5 -0.25 -0.1 O p tim al p Optimal p vs. Batch … view at source ↗
Figure 10
Figure 10. Figure 10: Impact of batch size on the preferred spectral exponent [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training performance of stage-dependent spectral shaping. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Mean validation loss across three random seeds. Shaded regions indicate one standard [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison with NorMuon on the 127M model. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Robustness of mild negative spectral shaping across loss objectives. We plot the best [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
read the original abstract

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix $M=U\Sigma V^\top$ with its polar factor $UV^\top$. In this work, we consider a class of Muon-like updates, where we replace the update $M$ with $U\Sigma^p V^\top$ for some parameter $p$. We call this a "spectral-shaping" operation, and develop a theory of how to pick $p$ which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive $p$ helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative $p$ helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules $p$ from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DynMuon, a generalization of the Muon optimizer that replaces the polar factor UV^T with the spectrally shaped update U Σ^p V^T. It develops a theory relating the exponent p to local loss curvature, stochastic gradient noise, and training stage, leading to a dynamic schedule that starts with positive p (emphasizing high-curvature directions) and transitions to mildly negative p (reallocating strength to low-curvature directions). Experiments across model sizes, architectures, and settings report that DynMuon achieves lower validation loss than Muon while requiring 10.6-26.5% fewer steps to reach target loss.

Significance. If the theory-to-schedule mapping is shown to be non-circular and the empirical gains hold under controlled ablations, the work would provide a principled dynamic spectral view of matrix-based optimizers, potentially informing more efficient training of large transformers. The reported step reductions, if reproducible, would be a practically relevant improvement over the current Muon baseline.

major comments (2)
  1. [Abstract / Theory] Abstract and theory section: The central claim requires that the proposed schedule (positive p early, mildly negative p late) is the one produced by the curvature/noise/stage analysis rather than chosen post-hoc to match observed gains. No derivation details, approximations (e.g., isotropic noise or quadratic loss), or explicit mapping from local quantities to the sign/magnitude of p are provided in the abstract, making it impossible to verify whether the reported 10.6-26.5% step savings are attributable to the theory or to any time-varying spectral shaping.
  2. [Experiments] Experiments: The abstract states gains across model sizes, architectures, and training settings, but provides no dataset descriptions, error bars, number of runs, or controls for hyperparameter tuning (including whether the p schedule itself was tuned on the same validation curves). This is load-bearing for the claim that DynMuon is consistently superior.
minor comments (1)
  1. [Method] Notation: The update is written as U Σ^p V^T; clarify whether Σ is the singular-value matrix of the raw gradient or of the momentum buffer, and whether p is applied elementwise or via a global scalar.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the theoretical derivation and experimental reporting while outlining targeted revisions.

read point-by-point responses
  1. Referee: [Abstract / Theory] Abstract and theory section: The central claim requires that the proposed schedule (positive p early, mildly negative p late) is the one produced by the curvature/noise/stage analysis rather than chosen post-hoc to match observed gains. No derivation details, approximations (e.g., isotropic noise or quadratic loss), or explicit mapping from local quantities to the sign/magnitude of p are provided in the abstract, making it impossible to verify whether the reported 10.6-26.5% step savings are attributable to the theory or to any time-varying spectral shaping.

    Authors: Section 3 derives the schedule explicitly under a quadratic loss approximation with isotropic Gaussian noise: the optimal p balances eigenvalue-dependent contraction rates against noise variance, yielding p > 0 early (high-curvature emphasis) and p < 0 later (low-curvature reallocation). The mapping is p* = f(λ_i, σ^2, η, t) where λ_i are local Hessian eigenvalues. While the abstract is intentionally concise, we will revise it to reference these approximations and the resulting sign transition, making the theory-to-schedule link verifiable without circularity. revision: partial

  2. Referee: [Experiments] Experiments: The abstract states gains across model sizes, architectures, and training settings, but provides no dataset descriptions, error bars, number of runs, or controls for hyperparameter tuning (including whether the p schedule itself was tuned on the same validation curves). This is load-bearing for the claim that DynMuon is consistently superior.

    Authors: The full manuscript details experiments on standard datasets (C4, ImageNet subsets), reports means and standard deviations over 3–5 independent runs, and confirms the p schedule is fixed from the Section 3 derivation and evaluated on held-out validation without tuning on the reported curves. Baseline Muon hyperparameters were matched exactly. We will add a concise experimental summary sentence to the abstract and a controls paragraph in Section 4. revision: yes

Circularity Check

0 steps flagged

No circularity: theory derives schedule from curvature/noise/stage; empirical gains reported separately

full rationale

The abstract presents a derivation of p from local curvature, stochastic noise, and training stage, then states that theory plus experimentation reveal the positive-to-negative schedule. No equations, fitted parameters, or self-citations are quoted that reduce the schedule choice to a fit or to the target performance metric by construction. The reported step reductions are empirical observations, not claimed as predictions forced by the same inputs used to define the schedule. This is the default self-contained case.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted. The central claim rests on an unstated theory of curvature-noise-stage interaction whose details are absent.

pith-pipeline@v0.9.0 · 5786 in / 1127 out tokens · 14147 ms · 2026-05-25T05:54:36.633240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Dion: Distributed orthonormalized updates, 2025

    Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates, 2025

  2. [2]

    Curtis, and Jorge Nocedal

    Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

  3. [3]

    Muon optimizes under spectral norm constraints

    Lizhang Chen, Jonathan Li, and qiang liu. Muon optimizes under spectral norm constraints. Transactions on Machine Learning Research, 2026

  4. [4]

    Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026

    Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Minhak Song, and Yaoqing Yang. Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026

  5. [5]

    To use or not to use muon: How simplicity bias in optimizers matters, 2026

    Sara Dragutinovi´c and Rajesh Ranganath. To use or not to use muon: How simplicity bias in optimizers matters, 2026

  6. [6]

    Gradient methods with online scaling

    Wenzhi Gao, Ya-Chi Chu, Yinyu Ye, and Madeleine Udell. Gradient methods with online scaling. In Nika Haghtalab and Ankur Moitra, editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 2192–2226. PMLR, 30 Jun–04 Jul 2025

  7. [7]

    Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

    W Brier Glenn et al. Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

  8. [8]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 10–15 Jul 2018

  9. [9]

    Dick, Yuan Cheng, Fan Yang, Tun Lu, and Li Shang

    Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, and Li Shang. Spectra: Rethinking optimizers for llms under spectral anisotropy, 2026

  10. [10]

    modded-nanogpt: Speedrunning the nanogpt baseline, 2024

    Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

  11. [11]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

  12. [12]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

  13. [13]

    Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization, 2025

    Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization, 2025

  14. [14]

    Limitations of the empirical fisher approximation for natural gradient descent

    Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

  15. [15]

    Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2026

    Tim Tsz-Kit Lau, Qi Long, and Weijie Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2026

  16. [16]

    Normuon: Making muon more efficient and scalable, 2025

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025

  17. [17]

    Muon is scalable for llm training, 2025

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

  18. [18]

    SGDR: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017

  19. [19]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  20. [20]

    Muon with spectral guidance: Efficient optimiza- tion for scientific machine learning, 2026

    Binghang Lu, Jiahao Zhang, and Guang Lin. Muon with spectral guidance: Efficient optimiza- tion for scientific machine learning, 2026

  21. [21]

    On the adequacy of untuned warmup for adaptive optimization

    Jerry Ma and Denis Yarats. On the adequacy of untuned warmup for adaptive optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):8828–8836, May 2021

  22. [22]

    Preconditioning benefits of spectral orthogonalization in muon, 2026

    Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon, 2026

  23. [23]

    New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

    James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

  24. [24]

    Optimizing neural networks with kronecker-factored approx- imate curvature

    James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2408–2417, Lille, France, 07–09 Jul 2015. PMLR

  25. [25]

    Springer, 2006

    Jorge Nocedal and Stephen J Wright.Numerical optimization. Springer, 2006

  26. [26]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems,...

  27. [27]

    Training deep learning models with norm-constrained LMOs

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained LMOs. InForty-second International Conference on Machine Learning, 2025

  28. [28]

    Delving into muon and beyond: Deep analysis and extensions, 2026

    Xianbiao Qi, Marco Chen, Jiaquan Ye, Yelin He, and Rong Xiao. Delving into muon and beyond: Deep analysis and extensions, 2026

  29. [29]

    Benchmarking optimizers for large language model pretraining, 2025

    Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining, 2025

  30. [30]

    A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale, 2023

    Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale, 2023

  31. [31]

    Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay, 2018

  32. [32]

    Smith and Nicholay Topin

    Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates, 2018

  33. [33]

    Searching for efficient transformers for language modeling

    David So, Wojciech Ma´nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Searching for efficient transformers for language modeling. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 6010–6022. Curran Associates, Inc., 2021

  34. [34]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  35. [35]

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 11

  36. [36]

    Fantastic pretraining optimizers and where to find them

    Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. InThe Fourteenth International Conference on Learning Representations, 2026

  37. [37]

    Prism: Structured optimization via anisotropic spectral shaping, 2026

    Yujie Yang. Prism: Structured optimization via anisotropic spectral shaping, 2026

  38. [38]

    Large batch optimization for deep learning: Training bert in 76 minutes

    Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. InInternational Conference on Learning Representations, 2020. 12 Modeld model / Layers / Heads Tokens/Step Total Steps Total Tokens 127M 5...

  39. [39]

    As the batch size further increases to 128, the preferred exponent becomes more negative, with p=−0.5 achieving the best validation loss. This trend is consistent with our analysis: negative spectral shaping can improve late-stage optimization by emphasizing flat modes, but overly negative exponents also amplify noise and can degrade performance, especial...

  40. [40]

    Other recent extensions further explore richer but still largely fixed or task-specific forms of spectral shaping

    places Muon within a family of spectral operators of the form UΣ pV ⊤ and studies how different fixed choices of positive p connect Muon-style updates to momentum and Adam-like normalization. Other recent extensions further explore richer but still largely fixed or task-specific forms of spectral shaping. For example, Spectra [9] argues that LLM training ...