DynMuon: A Dynamic Spectral Shaping View of Muon

Fangzhou Wu; Qiuyi Zhang; Rikhav Shah; Sandeep Silwal

arxiv: 2605.17109 · v2 · pith:HETTGOSPnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

DynMuon: A Dynamic Spectral Shaping View of Muon

Fangzhou Wu , Rikhav Shah , Sandeep Silwal , Qiuyi Zhang This is my paper

Pith reviewed 2026-05-25 05:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Muon optimizerspectral shapingdynamic exponenttransformer trainingvalidation lossoptimization schedule

0 comments

The pith

DynMuon improves Muon by scheduling the spectral exponent p from positive early to mildly negative later.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Muon updates can be generalized by applying a spectral shaping step that raises singular values to a power p, and that the best p changes with training progress. Positive p early on emphasizes directions of high curvature to speed up signal contraction, while mildly negative p later reallocates strength toward low-curvature directions that still carry useful information. The choice of p is guided by local curvature of the loss, stochastic gradient noise, and the current training stage. Experiments across model sizes, architectures, and settings show the resulting dynamic schedule produces lower validation loss than standard Muon and reaches any given target loss in 10.6 to 26.5 percent fewer steps.

Core claim

Replacing the polar factor update of Muon with U Sigma^p V^T and scheduling p from positive values early in training to mildly negative values later yields consistently better optimization trajectories than the fixed p=0 case.

What carries the argument

The spectral-shaping operation that replaces an update matrix M = U Sigma V^T with U Sigma^p V^T for a chosen exponent p.

If this is right

Positive p accelerates progress in early high-curvature phases.
Mildly negative p reallocates update energy to low-curvature directions that retain training signal later.
The schedule produces lower final validation loss than Muon.
Any target validation loss is reached in 10.6-26.5 percent fewer steps than Muon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other first-order methods might benefit from similar curvature-and-stage-dependent spectral adjustments.
The same principle could be tested on non-transformer architectures or different noise regimes to check generality.

Load-bearing premise

Local curvature, stochastic noise levels, and training stage together determine an optimal p that can be captured by a simple schedule shifting from positive to mildly negative values.

What would settle it

A controlled run in which either a fixed positive p, a fixed negative p, or a different dynamic schedule matches or exceeds the validation loss and step count of the proposed schedule on the same models and data.

Figures

Figures reproduced from arXiv: 2605.17109 by Fangzhou Wu, Qiuyi Zhang, Rikhav Shah, Sandeep Silwal.

**Figure 2.** Figure 2: Training performance of stage-dependent spectral shaping. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Validation loss trajectories across three model scales trained on 10B tokens. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: DynMuon outperforms Muon over architectures, training-token budgets, and learning rates. stable advantage [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Additional experiments for DynMuon across corpora, pmin choices, and spectral-shaping implementations. Left: DynMuon outperforms Muon on FineWeb-Edu. Middle: mildly negative pmin values perform best. Right: our spectral shaping approximations closely tracks exact SVD. 10000 15000 20000 Step 3.20 3.25 3.30 3.35 Validation Loss Ablation of Spectral Scheduling Strategy Muon (p = 0) Logistic schedule Abrupt 1!… view at source ↗

**Figure 6.** Figure 6: Ablation of spectral scheduling strategies and logistic schedule parameters [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: AdamW learning-rate sweep on the 127M GPT-style model, with the best validation loss [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Empirical support for curvature stability and gradient-curvature alignment. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Trends in the estimated noise exponent βt and the noise-curvature fit R2 during training. The power-law relationship between noise and curvature remains stable and pronounced throughout training. -0.5 -0.25 -0.1 0 Spectral Exponent p 9.05 9.10 9.15 Best Validation Loss Best Validation Loss vs. p B=2 B=4 B=8 B=16 B=32 B=64 B=128 2 4 8 16 32 64 128 Batch Size -0.5 -0.25 -0.1 O p tim al p Optimal p vs. Batch … view at source ↗

**Figure 10.** Figure 10: Impact of batch size on the preferred spectral exponent [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Training performance of stage-dependent spectral shaping. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Mean validation loss across three random seeds. Shaded regions indicate one standard [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison with NorMuon on the 127M model. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Robustness of mild negative spectral shaping across loss objectives. We plot the best [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix $M=U\Sigma V^\top$ with its polar factor $UV^\top$. In this work, we consider a class of Muon-like updates, where we replace the update $M$ with $U\Sigma^p V^\top$ for some parameter $p$. We call this a "spectral-shaping" operation, and develop a theory of how to pick $p$ which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive $p$ helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative $p$ helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules $p$ from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DynMuon adds a time-varying spectral exponent p to Muon and reports 10-26% step savings, but the curvature-noise-stage theory does not clearly derive the chosen schedule.

read the letter

The new piece is the dynamic schedule on p in the generalized Muon update U Σ^p V^T, moving from positive values early to mildly negative later. The experiments claim consistent validation-loss improvements and fewer steps to target across model sizes and settings. That is the concrete claim worth checking. The paper does a reasonable job of positioning the idea against the fixed polar-factor baseline and running the comparison at scale. The empirical numbers are the part that could matter for practitioners if they replicate. The soft spot is the theory-to-schedule step. The abstract links p to local curvature, stochastic noise, and training stage, yet gives no derivation or approximation steps that would produce the specific positive-to-negative transition. Without that, it is hard to tell whether the schedule follows from the stated factors or was selected to match observed gains. The reported step reductions could then be explained by any reasonable time-varying spectral shaping rather than the proposed view. Minor issues include the lack of error bars or dataset details in the summary, which makes it difficult to judge robustness. The work is aimed at researchers tuning optimizers for transformers. It is coherent enough on its own terms to deserve referee time, mainly to verify the experiments and see whether the theory section actually derives the schedule or only motivates it after the fact. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DynMuon, a generalization of the Muon optimizer that replaces the polar factor UV^T with the spectrally shaped update U Σ^p V^T. It develops a theory relating the exponent p to local loss curvature, stochastic gradient noise, and training stage, leading to a dynamic schedule that starts with positive p (emphasizing high-curvature directions) and transitions to mildly negative p (reallocating strength to low-curvature directions). Experiments across model sizes, architectures, and settings report that DynMuon achieves lower validation loss than Muon while requiring 10.6-26.5% fewer steps to reach target loss.

Significance. If the theory-to-schedule mapping is shown to be non-circular and the empirical gains hold under controlled ablations, the work would provide a principled dynamic spectral view of matrix-based optimizers, potentially informing more efficient training of large transformers. The reported step reductions, if reproducible, would be a practically relevant improvement over the current Muon baseline.

major comments (2)

[Abstract / Theory] Abstract and theory section: The central claim requires that the proposed schedule (positive p early, mildly negative p late) is the one produced by the curvature/noise/stage analysis rather than chosen post-hoc to match observed gains. No derivation details, approximations (e.g., isotropic noise or quadratic loss), or explicit mapping from local quantities to the sign/magnitude of p are provided in the abstract, making it impossible to verify whether the reported 10.6-26.5% step savings are attributable to the theory or to any time-varying spectral shaping.
[Experiments] Experiments: The abstract states gains across model sizes, architectures, and training settings, but provides no dataset descriptions, error bars, number of runs, or controls for hyperparameter tuning (including whether the p schedule itself was tuned on the same validation curves). This is load-bearing for the claim that DynMuon is consistently superior.

minor comments (1)

[Method] Notation: The update is written as U Σ^p V^T; clarify whether Σ is the singular-value matrix of the raw gradient or of the momentum buffer, and whether p is applied elementwise or via a global scalar.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the theoretical derivation and experimental reporting while outlining targeted revisions.

read point-by-point responses

Referee: [Abstract / Theory] Abstract and theory section: The central claim requires that the proposed schedule (positive p early, mildly negative p late) is the one produced by the curvature/noise/stage analysis rather than chosen post-hoc to match observed gains. No derivation details, approximations (e.g., isotropic noise or quadratic loss), or explicit mapping from local quantities to the sign/magnitude of p are provided in the abstract, making it impossible to verify whether the reported 10.6-26.5% step savings are attributable to the theory or to any time-varying spectral shaping.

Authors: Section 3 derives the schedule explicitly under a quadratic loss approximation with isotropic Gaussian noise: the optimal p balances eigenvalue-dependent contraction rates against noise variance, yielding p > 0 early (high-curvature emphasis) and p < 0 later (low-curvature reallocation). The mapping is p* = f(λ_i, σ^2, η, t) where λ_i are local Hessian eigenvalues. While the abstract is intentionally concise, we will revise it to reference these approximations and the resulting sign transition, making the theory-to-schedule link verifiable without circularity. revision: partial
Referee: [Experiments] Experiments: The abstract states gains across model sizes, architectures, and training settings, but provides no dataset descriptions, error bars, number of runs, or controls for hyperparameter tuning (including whether the p schedule itself was tuned on the same validation curves). This is load-bearing for the claim that DynMuon is consistently superior.

Authors: The full manuscript details experiments on standard datasets (C4, ImageNet subsets), reports means and standard deviations over 3–5 independent runs, and confirms the p schedule is fixed from the Section 3 derivation and evaluated on held-out validation without tuning on the reported curves. Baseline Muon hyperparameters were matched exactly. We will add a concise experimental summary sentence to the abstract and a controls paragraph in Section 4. revision: yes

Circularity Check

0 steps flagged

No circularity: theory derives schedule from curvature/noise/stage; empirical gains reported separately

full rationale

The abstract presents a derivation of p from local curvature, stochastic noise, and training stage, then states that theory plus experimentation reveal the positive-to-negative schedule. No equations, fitted parameters, or self-citations are quoted that reduce the schedule choice to a fit or to the target performance metric by construction. The reported step reductions are empirical observations, not claimed as predictions forced by the same inputs used to define the schedule. This is the default self-contained case.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted. The central claim rests on an unstated theory of curvature-noise-stage interaction whose details are absent.

pith-pipeline@v0.9.0 · 5786 in / 1127 out tokens · 14147 ms · 2026-05-25T05:54:36.633240+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Dion: Distributed orthonormalized updates, 2025

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates, 2025

work page 2025
[2]

Curtis, and Jorge Nocedal

Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

work page 2018
[3]

Muon optimizes under spectral norm constraints

Lizhang Chen, Jonathan Li, and qiang liu. Muon optimizes under spectral norm constraints. Transactions on Machine Learning Research, 2026

work page 2026
[4]

Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026

Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Minhak Song, and Yaoqing Yang. Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026

work page 2026
[5]

To use or not to use muon: How simplicity bias in optimizers matters, 2026

Sara Dragutinovi´c and Rajesh Ranganath. To use or not to use muon: How simplicity bias in optimizers matters, 2026

work page 2026
[6]

Gradient methods with online scaling

Wenzhi Gao, Ya-Chi Chu, Yinyu Ye, and Madeleine Udell. Gradient methods with online scaling. In Nika Haghtalab and Ankur Moitra, editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 2192–2226. PMLR, 30 Jun–04 Jul 2025

work page 2025
[7]

Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

W Brier Glenn et al. Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

work page 1950
[8]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 10–15 Jul 2018

work page 2018
[9]

Dick, Yuan Cheng, Fan Yang, Tun Lu, and Li Shang

Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, and Li Shang. Spectra: Rethinking optimizers for llms under spectral anisotropy, 2026

work page 2026
[10]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

work page 2024
[11]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

work page 2024
[12]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

work page 2020
[13]

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization, 2025

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization, 2025

work page 2025
[14]

Limitations of the empirical fisher approximation for natural gradient descent

Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019
[15]

Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2026

Tim Tsz-Kit Lau, Qi Long, and Weijie Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2026

work page 2026
[16]

Normuon: Making muon more efficient and scalable, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025

work page 2025
[17]

Muon is scalable for llm training, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page 2025
[18]

SGDR: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017

work page 2017
[19]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[20]

Muon with spectral guidance: Efficient optimiza- tion for scientific machine learning, 2026

Binghang Lu, Jiahao Zhang, and Guang Lin. Muon with spectral guidance: Efficient optimiza- tion for scientific machine learning, 2026

work page 2026
[21]

On the adequacy of untuned warmup for adaptive optimization

Jerry Ma and Denis Yarats. On the adequacy of untuned warmup for adaptive optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):8828–8836, May 2021

work page 2021
[22]

Preconditioning benefits of spectral orthogonalization in muon, 2026

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon, 2026

work page 2026
[23]

New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

work page 2020
[24]

Optimizing neural networks with kronecker-factored approx- imate curvature

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2408–2417, Lille, France, 07–09 Jul 2015. PMLR

work page 2015
[25]

Springer, 2006

Jorge Nocedal and Stephen J Wright.Numerical optimization. Springer, 2006

work page 2006
[26]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems,...

work page 2024
[27]

Training deep learning models with norm-constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained LMOs. InForty-second International Conference on Machine Learning, 2025

work page 2025
[28]

Delving into muon and beyond: Deep analysis and extensions, 2026

Xianbiao Qi, Marco Chen, Jiaquan Ye, Yelin He, and Rong Xiao. Delving into muon and beyond: Deep analysis and extensions, 2026

work page 2026
[29]

Benchmarking optimizers for large language model pretraining, 2025

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining, 2025

work page 2025
[30]

A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale, 2023

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale, 2023

work page 2023
[31]

Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay, 2018

work page 2018
[32]

Smith and Nicholay Topin

Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates, 2018

work page 2018
[33]

Searching for efficient transformers for language modeling

David So, Wojciech Ma´nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Searching for efficient transformers for language modeling. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 6010–6022. Curran Associates, Inc., 2021

work page 2021
[34]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[35]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 11

work page 2025
[36]

Fantastic pretraining optimizers and where to find them

Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[37]

Prism: Structured optimization via anisotropic spectral shaping, 2026

Yujie Yang. Prism: Structured optimization via anisotropic spectral shaping, 2026

work page 2026
[38]

Large batch optimization for deep learning: Training bert in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. InInternational Conference on Learning Representations, 2020. 12 Modeld model / Layers / Heads Tokens/Step Total Steps Total Tokens 127M 5...

work page 2020
[39]

As the batch size further increases to 128, the preferred exponent becomes more negative, with p=−0.5 achieving the best validation loss. This trend is consistent with our analysis: negative spectral shaping can improve late-stage optimization by emphasizing flat modes, but overly negative exponents also amplify noise and can degrade performance, especial...

work page
[40]

Other recent extensions further explore richer but still largely fixed or task-specific forms of spectral shaping

places Muon within a family of spectral operators of the form UΣ pV ⊤ and studies how different fixed choices of positive p connect Muon-style updates to momentum and Adam-like normalization. Other recent extensions further explore richer but still largely fixed or task-specific forms of spectral shaping. For example, Spectra [9] argues that LLM training ...

work page

[1] [1]

Dion: Distributed orthonormalized updates, 2025

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates, 2025

work page 2025

[2] [2]

Curtis, and Jorge Nocedal

Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

work page 2018

[3] [3]

Muon optimizes under spectral norm constraints

Lizhang Chen, Jonathan Li, and qiang liu. Muon optimizes under spectral norm constraints. Transactions on Machine Learning Research, 2026

work page 2026

[4] [4]

Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026

Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Minhak Song, and Yaoqing Yang. Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026

work page 2026

[5] [5]

To use or not to use muon: How simplicity bias in optimizers matters, 2026

Sara Dragutinovi´c and Rajesh Ranganath. To use or not to use muon: How simplicity bias in optimizers matters, 2026

work page 2026

[6] [6]

Gradient methods with online scaling

Wenzhi Gao, Ya-Chi Chu, Yinyu Ye, and Madeleine Udell. Gradient methods with online scaling. In Nika Haghtalab and Ankur Moitra, editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 2192–2226. PMLR, 30 Jun–04 Jul 2025

work page 2025

[7] [7]

Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

W Brier Glenn et al. Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

work page 1950

[8] [8]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 10–15 Jul 2018

work page 2018

[9] [9]

Dick, Yuan Cheng, Fan Yang, Tun Lu, and Li Shang

Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, and Li Shang. Spectra: Rethinking optimizers for llms under spectral anisotropy, 2026

work page 2026

[10] [10]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024

work page 2024

[11] [11]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

work page 2024

[12] [12]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

work page 2020

[13] [13]

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization, 2025

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization, 2025

work page 2025

[14] [14]

Limitations of the empirical fisher approximation for natural gradient descent

Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019

[15] [15]

Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2026

Tim Tsz-Kit Lau, Qi Long, and Weijie Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2026

work page 2026

[16] [16]

Normuon: Making muon more efficient and scalable, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025

work page 2025

[17] [17]

Muon is scalable for llm training, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page 2025

[18] [18]

SGDR: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017

work page 2017

[19] [19]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019

[20] [20]

Muon with spectral guidance: Efficient optimiza- tion for scientific machine learning, 2026

Binghang Lu, Jiahao Zhang, and Guang Lin. Muon with spectral guidance: Efficient optimiza- tion for scientific machine learning, 2026

work page 2026

[21] [21]

On the adequacy of untuned warmup for adaptive optimization

Jerry Ma and Denis Yarats. On the adequacy of untuned warmup for adaptive optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):8828–8836, May 2021

work page 2021

[22] [22]

Preconditioning benefits of spectral orthogonalization in muon, 2026

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon, 2026

work page 2026

[23] [23]

New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

work page 2020

[24] [24]

Optimizing neural networks with kronecker-factored approx- imate curvature

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2408–2417, Lille, France, 07–09 Jul 2015. PMLR

work page 2015

[25] [25]

Springer, 2006

Jorge Nocedal and Stephen J Wright.Numerical optimization. Springer, 2006

work page 2006

[26] [26]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems,...

work page 2024

[27] [27]

Training deep learning models with norm-constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained LMOs. InForty-second International Conference on Machine Learning, 2025

work page 2025

[28] [28]

Delving into muon and beyond: Deep analysis and extensions, 2026

Xianbiao Qi, Marco Chen, Jiaquan Ye, Yelin He, and Rong Xiao. Delving into muon and beyond: Deep analysis and extensions, 2026

work page 2026

[29] [29]

Benchmarking optimizers for large language model pretraining, 2025

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining, 2025

work page 2025

[30] [30]

A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale, 2023

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale, 2023

work page 2023

[31] [31]

Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay, 2018

work page 2018

[32] [32]

Smith and Nicholay Topin

Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates, 2018

work page 2018

[33] [33]

Searching for efficient transformers for language modeling

David So, Wojciech Ma´nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Searching for efficient transformers for language modeling. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 6010–6022. Curran Associates, Inc., 2021

work page 2021

[34] [34]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024

[35] [35]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 11

work page 2025

[36] [36]

Fantastic pretraining optimizers and where to find them

Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[37] [37]

Prism: Structured optimization via anisotropic spectral shaping, 2026

Yujie Yang. Prism: Structured optimization via anisotropic spectral shaping, 2026

work page 2026

[38] [38]

Large batch optimization for deep learning: Training bert in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. InInternational Conference on Learning Representations, 2020. 12 Modeld model / Layers / Heads Tokens/Step Total Steps Total Tokens 127M 5...

work page 2020

[39] [39]

As the batch size further increases to 128, the preferred exponent becomes more negative, with p=−0.5 achieving the best validation loss. This trend is consistent with our analysis: negative spectral shaping can improve late-stage optimization by emphasizing flat modes, but overly negative exponents also amplify noise and can degrade performance, especial...

work page

[40] [40]

Other recent extensions further explore richer but still largely fixed or task-specific forms of spectral shaping

places Muon within a family of spectral operators of the form UΣ pV ⊤ and studies how different fixed choices of positive p connect Muon-style updates to momentum and Adam-like normalization. Other recent extensions further explore richer but still largely fixed or task-specific forms of spectral shaping. For example, Spectra [9] argues that LLM training ...

work page