The Spectral Dynamics and Noise Geometry of Muon

Mahmoud Abdelmoneum; Pierfrancesco Beneventano; Tomaso Poggio

arxiv: 2606.08388 · v1 · pith:OMM7TXM4new · submitted 2026-06-07 · 💻 cs.LG · math.OC· stat.ML

The Spectral Dynamics and Noise Geometry of Muon

Pierfrancesco Beneventano , Mahmoud Abdelmoneum , Tomaso Poggio This is my paper

Pith reviewed 2026-06-27 18:30 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords Muon optimizerpolar updatesingular value dynamicsspectral biasmatrix gradiententropy maximizationneural network optimizationregression model

0 comments

The pith

Muon replaces matrix gradients with polar factors to flatten singular spectra while preserving directions, maximizing one-step entropy under alignment assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Muon's replacement of a gradient matrix by its polar factor creates an optimization bias toward flat singular-value spectra. Under explicit alignment assumptions, this choice maximizes one-step entropy among certain bounded updates that use the gradient's singular directions without adapting to the current weight spectrum. In an underdetermined regression model, the continuous-time dynamics show the normalized spectrum moving toward equal nonzero singular values when a measurement-dependent condition holds. Experiments confirm the flattening effect separate from simple rescaling, and show Muon improving some pretraining tasks but not others depending on the regime. A reader would care because this explains the conditions under which Muon's bias toward keeping many directions active is helpful rather than a universal advantage.

Core claim

Muon replaces a matrix gradient G=UΣV^T by its polar factor UV^T. This keeps the singular directions selected by the gradient but makes the update spectrum flat. Under explicit alignment assumptions, the polar update is the one-step entropy-maximizing choice among bounded updates that use the gradient singular directions and do not adapt to the current weight spectrum. In an underdetermined regression model, exact singular-value dynamics for continuous-time Muon are derived, identifying a measurement-dependent condition under which the normalized spectrum moves toward equal nonzero singular values. This geometry rules out a common low-rank interpretation because at fixed Frobenius norm, Muon

What carries the argument

The polar factor UV^T of the gradient G=UΣV^T, which keeps singular directions from the gradient but forces a flat spectrum in the update.

If this is right

In underdetermined regression, the normalized spectrum moves toward equal nonzero singular values under the identified measurement-dependent condition.
At fixed Frobenius norm, Muon's state has a flat spectrum, distinct from nuclear-norm minimization which favors concentration.
Controlled matrix-sensing experiments recover the predicted flattening trend and separate the effect from gradient rescaling.
In small NanoGPT pretraining, Muon preserves stable rank, shows a broad learning-rate plateau, and improves validation loss relative to AdamW.
In a matched small-ViT control, the performance ranking reverses, showing the effect is regime-dependent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The entropy-max property under alignment may imply Muon favors more uniform parameter usage in tasks with diverse or high-dimensional features.
The identified dynamics could be tested in other continuous-time matrix optimizers to see if polar steps produce similar flattening.
Regime dependence suggests checking Muon on problems where maintaining activity across many spectral directions aids generalization.
The distinction from low-rank biases may connect to understanding when flat-spectrum updates help avoid premature concentration in training.

Load-bearing premise

The explicit alignment assumptions between weights and gradients that are required for the entropy-maximization proof.

What would settle it

Whether the singular values of weights in continuous-time Muon applied to an underdetermined regression problem flatten toward equality when the measurement-dependent condition holds.

Figures

Figures reproduced from arXiv: 2606.08388 by Mahmoud Abdelmoneum, Pierfrancesco Beneventano, Tomaso Poggio.

**Figure 1.** Figure 1: Regime reversal. Muon preserves stable rank in small NanoGPT, while AdamW wins in a matched smallViT/CIFAR-10 control. This is directional evidence for regime dependence, not a pure modality intervention. Contributions. We make four contributions. • A one-step spectral-bias theorem. Under explicit alignment assumptions, we prove that the polar profile maximizes first-order spectralentropy gain among boun… view at source ↗

**Figure 2.** Figure 2: Geometry of the polar regularizer. Left: singular-value trajectories under the projected self-polar flow of Theorem 1; rates αi ∈ [0, 1] are determined by the measurement geometry. Right: level sets of Rpw(σ1, σ2) = − P i̸=j log(σi + σj ) on the Frobenius sphere; the Rpw-minimizer coincides with the flat-spectrum point, while the nuclear-norm minimizer lies at a corner. The figure illustrates the variation… view at source ↗

**Figure 3.** Figure 3: Nuclear-norm gap. Left: large instance (p=50, d=20); Muon stabilizes at ∥WMuon∥∗ ≈ 20.3 vs. CVXPY minimum 14.1 (1.44×, non-diminishing). Right: 10-seed family (p=n=6, d=10); gap ranges 1.29×–2.02× (mean 1.59×). Zero seeds converge to the nuclear-norm minimum. 6 Experiments We exercise the dynamics of Theorems 1–4 as sanity checks. Full panels appear in Appendix D. Nuclear-norm falsification (matrix sensing… view at source ↗

**Figure 4.** Figure 4: Diagnostic checks of Theorem 4 (square-G regime). (a) S(µ) formula matches r(r−1)/(4σ 2 0 ) at r ∈ {2, 4, 8, 16} (algebraic verification, not a stochastic-optimizer test). (b) AR(1) momentum factor (1−β)/(1+β) vs. naive (1−β) 2 for β = 0.95. (c) Bcrit = 2σ 2S(µ)/r across spectrum types. The right-panel legend labels (“flat spectrum”, “concentrated spectrum”) are from an earlier convention and may be mislea… view at source ↗

**Figure 5.** Figure 5: NanoGPT (124M) layerwise spectral profiling, Muon vs. AdamW over 5,000 steps. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Sanity check of Theorem 1 dynamics in the general affine matrix-sensing setting ( [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Frame alignment ∥ sin Θ∥F during Muon training across configurations (d ∈ {10, 20, 50}, 10 seeds each). Misalignment remains < 0.1 post-transient. 0 500 1000 1500 2000 2500 3000 Training Step 2 0 2 4 6 (S - C· ) 1e 6 d = 10 (3065/6000 positive in 2nd half) zero line 0 500 1000 1500 2000 2500 3000 Training Step 3 2 1 0 1 2 3 (S - C· ) 1e 6 d = 20 (2964/6000 positive in 2nd half) zero line 0 1000 2000 3000 4… view at source ↗

**Figure 8.** Figure 8: Spectral-flattening sign quantity P i αiqi tracked over training. Negative on average during the early flattening phase, then fluctuates near zero with magnitude ∼ 10−6 once the spectrum is near-flat (where qi → 0); per-seed positive-fraction counts in the second half of training are roughly 51%, 49%, 45%, consistent with the criterion governing the approach to the flat spectrum rather than the asymptotic … view at source ↗

**Figure 9.** Figure 9: ATSR phase diagram. Left: coupled decay; ATSR explodes at λ ≥ 10−2 . Right: decoupled decay; ATSR nearly constant for λ ≤ 10−2 . Phase boundary at λ ≈ 10−2 . 10 5 10 4 10 3 10 2 10 1 10 0 Weight decay 2 4 6 8 10 12 ATSR (lower = better concurrent acquisition) = 0.01 phase boundary 7.54× (coupled) 2.28× (decoupled) (a) ATSR vs. Weight Decay: Corollary 3 Coupled WD (destructive) Decoupled WD (benign) Baselin… view at source ↗

**Figure 10.** Figure 10: Consolidated empirical validation. (a) ATSR vs. λ for coupled and decoupled decay. (b) S(µ) formula verification across r ∈ {2, 4, 8, 16}. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Diagnostics panel: combined view of ∥ sin Θ∥F , P i αiqi , and per-step singular-value increments across training, supporting the assumptions of Theorems 1–3. Placeholder: figures/fig delta P.pdf not yet generated; see CHANGES.md. Expected content: δP (t) trajectories across 10 seeds [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Direct measurement of the gauge-invariant polar misalignment [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

read the original abstract

Muon replaces a matrix gradient $G=U\Sigma V^\top$ by its polar factor $UV^\top$. This keeps the singular directions selected by the gradient, but makes the update spectrum flat. We study the optimization bias created by this operation. Under explicit alignment assumptions, we prove that the polar update is the one-step entropy-maximizing choice among bounded updates that use the gradient singular directions and do not adapt to the current weight spectrum. In an underdetermined regression model, we derive exact singular-value dynamics for continuous-time Muon and identify a measurement-dependent condition under which the normalized spectrum moves toward equal nonzero singular values. This geometry also rules out a common low-rank interpretation: at fixed Frobenius norm, Muon's distinguished state has a flat spectrum, whereas nuclear-norm minimization favors spectral concentration. Controlled matrix-sensing experiments separate the effect from simple gradient rescaling, show that norm-matched gradient descent does not reproduce Muon, and recover the predicted flattening trend across broad ablations. In small NanoGPT pretraining, Muon preserves stable rank, has a broad learning-rate plateau, and improves validation loss relative to AdamW; in a matched small-ViT control, the ranking reverses. The resulting picture is regime-dependent: Muon is not universally superior, but its flat-spectrum bias can help when many spectral directions need to remain active.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon gets exact singular-value dynamics in regression plus an entropy-max proof under alignment assumptions, but those assumptions lack quantitative bounds and the gains stay regime-specific.

read the letter

The main points are the derivation of continuous-time singular-value dynamics for Muon in an underdetermined regression model, the measurement-dependent flattening condition, and the proof that the polar update maximizes entropy among bounded updates that follow gradient singular directions without adapting to the current weight spectrum.

The dynamics derivation and the flattening condition are new relative to earlier Muon papers. The entropy-max result under the stated assumptions is a clean geometric claim. The matrix-sensing experiments do separate the polar operation from simple gradient rescaling and show that norm-matched gradient descent does not reproduce the same behavior. The NanoGPT and small-ViT runs are honest about the outcome depending on the architecture.

The soft spots sit mainly with the alignment assumptions required for the entropy claim. No bound is given on how much misalignment between gradient and weight singular vectors can be tolerated before the one-step optimality fails, so the practical reach of that result is unclear. The regression model ties the flattening condition to quantities defined inside the model, which raises a moderate circularity issue even if the authors label it a prediction. The pretraining controls are small-scale, so the stable-rank and loss observations need checking at larger sizes.

This paper is for researchers working on optimizer geometry or spectral bias in matrix updates. Readers who want a mechanistic account of why flat-spectrum updates can preserve rank in some settings will find the geometry and the controlled ablations useful. It has enough formal content and targeted experiments to deserve a serious referee.

Referee Report

3 major / 2 minor

Summary. The paper studies Muon, which replaces a matrix gradient G = U Σ V^T by its polar factor UV^T. Under explicit alignment assumptions, it proves this is the one-step entropy-maximizing choice among bounded updates that use the gradient's singular directions without adapting to the current weight spectrum. In an underdetermined regression model it derives exact continuous-time singular-value dynamics and a measurement-dependent condition for the normalized spectrum to flatten toward equal nonzero singular values. Experiments in matrix sensing separate the effect from gradient rescaling, while small NanoGPT and ViT pretraining runs show regime-dependent benefits for stable rank and validation loss.

Significance. If the central claims hold, the work supplies a geometric account of Muon's flat-spectrum bias that distinguishes it from both simple rescaling and nuclear-norm minimization, together with falsifiable predictions for spectral dynamics. The controlled matrix-sensing ablations and the explicit one-step optimality result under stated assumptions are strengths that would strengthen the manuscript's contribution to understanding optimization geometry in deep learning.

major comments (3)

[Abstract] Abstract and the entropy-maximization statement: the one-step optimality result is conditioned on 'explicit alignment assumptions' whose necessity, quantitative scope (e.g., allowable misalignment angle between gradient and weight singular vectors), and verification in the experimental regimes are not supplied; without such bounds the claimed justification for the polar update does not apply when the assumptions fail even moderately.
[§4] §4 (underdetermined regression model): the measurement-dependent condition for spectral flattening is derived within the same model used to define the dynamics; this creates a circularity risk for the prediction that the normalized spectrum moves toward equal nonzero singular values, as the condition may reduce to quantities already fixed by the regression setup.
[Matrix-sensing experiments] Matrix-sensing experiments (controlled ablations): while they separate Muon from norm-matched gradient descent, the reported flattening trend is shown only for the specific sensing matrices and noise levels chosen; no quantitative check is given that the alignment assumptions required by the theory hold in these runs, weakening the link between the proof and the observed geometry.

minor comments (2)

Notation for the polar factor UV^T and the singular-value dynamics should be introduced with an explicit equation number on first use to improve traceability.
The NanoGPT and ViT results would benefit from reporting the precise hyperparameter ranges explored for the learning-rate plateau claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the alignment assumptions and their empirical verification.

read point-by-point responses

Referee: [Abstract] Abstract and the entropy-maximization statement: the one-step optimality result is conditioned on 'explicit alignment assumptions' whose necessity, quantitative scope (e.g., allowable misalignment angle between gradient and weight singular vectors), and verification in the experimental regimes are not supplied; without such bounds the claimed justification for the polar update does not apply when the assumptions fail even moderately.

Authors: We agree that the alignment assumptions require explicit quantification and verification. In the revision we will add a dedicated subsection deriving the necessity of the assumptions together with quantitative bounds on the allowable misalignment angle (in terms of the angle between the singular vectors of the gradient and the current weights) under which the one-step entropy-maximization result continues to hold. We will also report empirical measurements of these angles in both the matrix-sensing and pretraining experiments to confirm that the assumptions are satisfied in the reported regimes. revision: yes
Referee: [§4] §4 (underdetermined regression model): the measurement-dependent condition for spectral flattening is derived within the same model used to define the dynamics; this creates a circularity risk for the prediction that the normalized spectrum moves toward equal nonzero singular values, as the condition may reduce to quantities already fixed by the regression setup.

Authors: The continuous-time dynamics are derived exactly from the model, but the flattening condition is expressed solely in terms of the fixed measurement matrix and noise level; it is therefore independent of the evolving singular-value trajectory and yields a falsifiable prediction for any given sensing matrix. Nevertheless, to remove any appearance of circularity we will revise §4 to separate the model definition from the derived condition more explicitly and add a short remark on how the condition can be checked directly from the measurements. revision: partial
Referee: [Matrix-sensing experiments] Matrix-sensing experiments (controlled ablations): while they separate Muon from norm-matched gradient descent, the reported flattening trend is shown only for the specific sensing matrices and noise levels chosen; no quantitative check is given that the alignment assumptions required by the theory hold in these runs, weakening the link between the proof and the observed geometry.

Authors: We acknowledge the gap. In the revised manuscript we will include quantitative checks (singular-vector angles between gradient and weights) for the matrix-sensing runs, confirming that the alignment assumptions hold under the chosen sensing matrices and noise levels. This will directly tie the observed flattening to the theoretical result. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations are self-contained mathematical proofs and model-specific dynamics.

full rationale

The paper states a conditional proof under explicit alignment assumptions for the entropy-maximization property and separately derives singular-value dynamics inside an underdetermined regression model, identifying a measurement-dependent flattening condition. Neither step reduces a claimed prediction to a fitted input by construction, nor relies on self-citation load-bearing, ansatz smuggling, or renaming. The alignment assumptions are external prerequisites for the one-step result rather than outputs of the same equations; the regression-model condition is derived from the model's own measurement process without circular re-use of the target flattening as an input. The overall chain therefore remains independent of its conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the key unstated premise is the set of explicit alignment assumptions used for the optimality proof.

axioms (1)

domain assumption Explicit alignment assumptions between gradient singular directions and weights
Invoked to prove the polar update is the one-step entropy-maximizing choice.

pith-pipeline@v0.9.1-grok · 5779 in / 1272 out tokens · 20546 ms · 2026-06-27T18:30:05.350023+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 2 canonical work pages · 2 internal anchors

[1]

pAI/MSc: ML Theory Research with Humans on the Loop

Mahmoud Abdelmoneum, Pierfrancesco Beneventano, and Tomaso Poggio. pAI/MSc: ML theory research with humans on the loop. 2026. arXiv:2604.20622 [cs.AI]; DOI:https://doi.org/10.48550/arXiv.2604.20622

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.20622 2026
[2]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Blog post, https://kellerjordan.github.io/ posts/muon/, 2024. Reference implementation: https://github.com/KellerJordan/Muon; used in the modded- nanogpt speedrun benchmarkhttps://github.com/Kel...

2024
[3]

Muon is scalable for LLM training, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, et al. Muon is scalable for LLM training, 2025

2025
[4]

Crockett

Lisa Messeri and Molly J. Crockett. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49–58, 2024

2024
[5]

Agent systems for academic research automation

Pierfrancesco Beneventano, Riccardo Neumarker, Theodoros Evgeniou, Marc Gong Bacvanski, Kushagra Tiwary, Emanuele Rimoldi, Mehdi Hajoub, Yulu Gan, Qianli Liao, Mahmoud Abdelmoneum, et al. Agent systems for academic research automation. InICML 2026 AI for Science Workshop, 2026

2026
[6]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

2018
[7]

Implicit regularization in matrix factorization

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017
[8]

Implicit regularization in matrix sensing via mirror descent

Fan Wu and Patrick Rebeschini. Implicit regularization in matrix sensing via mirror descent. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021
[9]

Characterizing implicit bias in terms of optimization geometry

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. InProceedings of the 35th International Conference on Machine Learning (ICML), 2018

2018
[10]

Old optimizer, new norm: An anthology, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology, 2024

2024
[11]

Orthogonalising gradients to speed up neural network optimisation, 2022

Mark Tuddenham, Adam Pr¨ ugel-Bennett, and Jonathan Hare. Orthogonalising gradients to speed up neural network optimisation, 2022. arXiv preprint

2022
[12]

A note on the convergence of Muon, 2025

Jiaxiang Li and Mingyi Hong. A note on the convergence of Muon, 2025. arXiv preprint

2025
[13]

Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization, 2025

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization, 2025. arXiv preprint

2025
[14]

Muon optimizes under spectral norm constraints, 2025

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints, 2025

2025
[15]

Higham.Functions of Matrices: Theory and Computation

Nicholas J. Higham.Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2008

2008
[16]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the Muon algorithm, 2025

2025
[17]

PolarGrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2025

Tim Tsz-Kit Lau, Qi Long, and Weijie Su. PolarGrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2025. 14

2025
[18]

When do spectral gradient updates help in deep learning?, 2025

Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning?, 2025

2025
[19]

On the convergence analysis of muon, 2025

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon, 2025

2025
[20]

David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, and Matthew R. Walter. Approaching deep learning through the spectral dynamics of weights, 2024

2024
[21]

From SGD to spectra: A theory of neural network weight dynamics, 2025

Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, and Anirudh Gajula. From SGD to spectra: A theory of neural network weight dynamics, 2025

2025
[22]

Implicit bias of spectral descent and Muon on multiclass separable data, 2025

Chen Fan, Mark Schmidt, and Christos Thrampoulidis. Implicit bias of spectral descent and Muon on multiclass separable data, 2025

2025
[23]

The implicit bias of Adam and Muon on smooth homogeneous neural networks, 2026

Eitan Gronich and Gal Vardi. The implicit bias of Adam and Muon on smooth homogeneous neural networks, 2026

2026
[24]

Uniform spectral growth and convergence of Muon in LoRA-style matrix factorization, 2026

Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, and Chulhee Yun. Uniform spectral growth and convergence of Muon in LoRA-style matrix factorization, 2026

2026
[25]

How Muon’s spectral design benefits generalization: A study on imbalanced data, 2025

Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, and Christos Thrampoulidis. How Muon’s spectral design benefits generalization: A study on imbalanced data, 2025

2025
[26]

Convergence bound and critical batch size of muon optimizer, 2025

Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka. Convergence bound and critical batch size of muon optimizer, 2025

2025
[27]

AdaMuon: Adaptive Muon optimizer, 2025

Chongjie Si, Debing Zhang, and Wei Shen. AdaMuon: Adaptive Muon optimizer, 2025. arXiv preprint

2025
[28]

OrScale: Orthogonalised optimization with layer-wise trust-ratio scaling, 2026

Yuxuan Lou and Yang You. OrScale: Orthogonalised optimization with layer-wise trust-ratio scaling, 2026. arXiv preprint

2026
[29]

AMUSE: Anytime Muon with stable gradient evaluation, 2026

Jueun Kim, Baekrok Shin, Jihun Yun, Beomhan Baek, Minhak Song, and Chulhee Yun. AMUSE: Anytime Muon with stable gradient evaluation, 2026. arXiv preprint

2026
[30]

TrasMuon: Trust-region adaptive scaling for orthogonalized momentum optimizers, 2026

Peng Cheng, Jiucheng Zang, Qingnan Li, Liheng Ma, Yufei Cui, Yingxue Zhang, Boxing Chen, Ming Jian, and Wen Tong. TrasMuon: Trust-region adaptive scaling for orthogonalized momentum optimizers, 2026. arXiv preprint

2026
[31]

Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

Clarissa Lauditi, Cengiz Pehlevan, and Blake Bordelon. Spectral dynamics in deep networks: Feature learning, outlier escape, and learning rate transfer. 2026. ArXiv preprint 2605.07870. DOI 10.48550/arXiv.2605.07870

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.07870 2026
[32]

Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J

Essential AI, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J. Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, and Ashis...

2025
[33]

An empirical model of large-batch training, 2018

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training, 2018

2018
[34]

Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V

Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. InProceedings of the International Conference on Learning Representations (ICLR), 2018

2018
[35]

How does critical batch size scale in pre-training?, 2024

Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, and Sham Kakade. How does critical batch size scale in pre-training?, 2024

2024
[36]

Critical batch size revisited: A simple empirical approach to large-batch language model training, 2025

William Merrill, Shane Arora, Dirk Groeneveld, and Hannaneh Hajishirzi. Critical batch size revisited: A simple empirical approach to large-batch language model training, 2025

2025
[37]

The Newton-Muon optimizer, 2026

Zhehang Du and Weijie Su. The Newton-Muon optimizer, 2026. arXiv preprint

2026
[38]

Muon 2: Boosting Muon via adaptive second-moment preconditioning, 2026

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, and Zheng Zhang. Muon 2: Boosting Muon via adaptive second-moment preconditioning, 2026. arXiv preprint. 15

2026
[39]

Spectral flattening is all Muon needs: How orthogonalization controls learning rate and convergence, 2026

Tien-Phat Nguyen, Truong Nguyen, Minh-Phuc Truong, Tuc Nguyen, James Bailey, and Trung Le. Spectral flattening is all Muon needs: How orthogonalization controls learning rate and convergence, 2026. arXiv preprint

2026
[40]

Muon is not that special: Random or inverted spectra work just as well, 2026

Zakhar Shumaylov, Natha¨ el Da Costa, Peter Zaika, B´ alint Mucs´ anyi, Alex Massucco, Yoav Gelberg, Carola- Bibiane Sch¨ onlieb, Yarin Gal, and Philipp Hennig. Muon is not that special: Random or inverted spectra work just as well, 2026. arXiv preprint

2026
[41]

Tetiana Parshakova, Ahmed Khaled, Michael Crawshaw, Guillaume Garrigos, and Robert M. Gower. Muon does not converge on convex lipschitz functions, 2026. arXiv preprint

2026
[42]

Rethinking Muon beyond pretraining: Spectral failures and high-pass remedies for VLA and RLVR, 2026

Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella, and Sijia Liu. Rethinking Muon beyond pretraining: Spectral failures and high-pass remedies for VLA and RLVR, 2026. arXiv preprint

2026
[43]

Southworth, Shuai Jiang, Daniel McBride, Eric C

Ben S. Southworth, Shuai Jiang, Daniel McBride, Eric C. Cyr, and Stephen Thomas. Muon in vision transformers: Optimizer-recipe interactions and gradient spectra, 2026. arXiv preprint

2026
[44]

When Muon optimizer meets adversarial training: A theoretical and empirical study, 2026

Jun Yan, Weiquan Huang, Jiankai Zuo, Yujian Mo, Xi Fang, Chengliang Wu, and Zeming Wei. When Muon optimizer meets adversarial training: A theoretical and empirical study, 2026. arXiv preprint

2026
[45]

DP-Muon: Differentially private optimization via matrix-orthogonalized momentum, 2026

Jihwan Kim and Chenglin Fan. DP-Muon: Differentially private optimization via matrix-orthogonalized momentum, 2026. arXiv preprint

2026
[46]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations (ICLR), 2019

2019
[47]

Perturbation bounds for the polar decomposition.SIAM Journal on Matrix Analysis and Applications, 14(2):588–597, 1993

Roy Mathias. Perturbation bounds for the polar decomposition.SIAM Journal on Matrix Analysis and Applications, 14(2):588–597, 1993

1993
[48]

G. W. Stewart and Ji-Guang Sun.Matrix Perturbation Theory. Academic Press, 1990. A Scope Comparison with Concurrent Work Table 2: Scope comparison: this paper versus concurrent work on Muon and Muon-like optimizers. Referenced from Section 3. Work Setting Main theorem Relation

1990
[49]

Classification Max-margin implicit bias Complementary
[50]

Classification Max-margin (homogeneous) Complementary
[51]

LoRA (reg.) Uniform spectral growth Consistent w/ Thm. 1
[52]

Equal-rate PC learning Consistent w/ Thm

Bilinear cls. Equal-rate PC learning Consistent w/ Thm. 1(iv)
[53]

LLM pretraining Practical scaling Empirical motivation
[54]

4 This paper Regression (MSE) Cond

Polar-map SGD Convergence-rateB crit Complementary to Thm. 4 This paper Regression (MSE) Cond. spectral dyn. onM— B Proofs of Main Results B.1 Full Proof of Theorem 1: Spectral Dynamics We prove the four parts in order. The pairwise spectral functional is Rpw(W) =− X i̸=j log σi(W) +σ j(W) .(12) 16 Setup. W∈R m×n has SVD W = UΣV ⊤, Σ = diag(σ1, . . . , σr...

2000

[1] [1]

pAI/MSc: ML Theory Research with Humans on the Loop

Mahmoud Abdelmoneum, Pierfrancesco Beneventano, and Tomaso Poggio. pAI/MSc: ML theory research with humans on the loop. 2026. arXiv:2604.20622 [cs.AI]; DOI:https://doi.org/10.48550/arXiv.2604.20622

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.20622 2026

[2] [2]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Blog post, https://kellerjordan.github.io/ posts/muon/, 2024. Reference implementation: https://github.com/KellerJordan/Muon; used in the modded- nanogpt speedrun benchmarkhttps://github.com/Kel...

2024

[3] [3]

Muon is scalable for LLM training, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, et al. Muon is scalable for LLM training, 2025

2025

[4] [4]

Crockett

Lisa Messeri and Molly J. Crockett. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49–58, 2024

2024

[5] [5]

Agent systems for academic research automation

Pierfrancesco Beneventano, Riccardo Neumarker, Theodoros Evgeniou, Marc Gong Bacvanski, Kushagra Tiwary, Emanuele Rimoldi, Mehdi Hajoub, Yulu Gan, Qianli Liao, Mahmoud Abdelmoneum, et al. Agent systems for academic research automation. InICML 2026 AI for Science Workshop, 2026

2026

[6] [6]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

2018

[7] [7]

Implicit regularization in matrix factorization

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017

[8] [8]

Implicit regularization in matrix sensing via mirror descent

Fan Wu and Patrick Rebeschini. Implicit regularization in matrix sensing via mirror descent. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021

[9] [9]

Characterizing implicit bias in terms of optimization geometry

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. InProceedings of the 35th International Conference on Machine Learning (ICML), 2018

2018

[10] [10]

Old optimizer, new norm: An anthology, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology, 2024

2024

[11] [11]

Orthogonalising gradients to speed up neural network optimisation, 2022

Mark Tuddenham, Adam Pr¨ ugel-Bennett, and Jonathan Hare. Orthogonalising gradients to speed up neural network optimisation, 2022. arXiv preprint

2022

[12] [12]

A note on the convergence of Muon, 2025

Jiaxiang Li and Mingyi Hong. A note on the convergence of Muon, 2025. arXiv preprint

2025

[13] [13]

Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization, 2025

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization, 2025. arXiv preprint

2025

[14] [14]

Muon optimizes under spectral norm constraints, 2025

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints, 2025

2025

[15] [15]

Higham.Functions of Matrices: Theory and Computation

Nicholas J. Higham.Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2008

2008

[16] [16]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the Muon algorithm, 2025

2025

[17] [17]

PolarGrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2025

Tim Tsz-Kit Lau, Qi Long, and Weijie Su. PolarGrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective, 2025. 14

2025

[18] [18]

When do spectral gradient updates help in deep learning?, 2025

Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning?, 2025

2025

[19] [19]

On the convergence analysis of muon, 2025

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon, 2025

2025

[20] [20]

David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, and Matthew R. Walter. Approaching deep learning through the spectral dynamics of weights, 2024

2024

[21] [21]

From SGD to spectra: A theory of neural network weight dynamics, 2025

Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, and Anirudh Gajula. From SGD to spectra: A theory of neural network weight dynamics, 2025

2025

[22] [22]

Implicit bias of spectral descent and Muon on multiclass separable data, 2025

Chen Fan, Mark Schmidt, and Christos Thrampoulidis. Implicit bias of spectral descent and Muon on multiclass separable data, 2025

2025

[23] [23]

The implicit bias of Adam and Muon on smooth homogeneous neural networks, 2026

Eitan Gronich and Gal Vardi. The implicit bias of Adam and Muon on smooth homogeneous neural networks, 2026

2026

[24] [24]

Uniform spectral growth and convergence of Muon in LoRA-style matrix factorization, 2026

Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, and Chulhee Yun. Uniform spectral growth and convergence of Muon in LoRA-style matrix factorization, 2026

2026

[25] [25]

How Muon’s spectral design benefits generalization: A study on imbalanced data, 2025

Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, and Christos Thrampoulidis. How Muon’s spectral design benefits generalization: A study on imbalanced data, 2025

2025

[26] [26]

Convergence bound and critical batch size of muon optimizer, 2025

Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka. Convergence bound and critical batch size of muon optimizer, 2025

2025

[27] [27]

AdaMuon: Adaptive Muon optimizer, 2025

Chongjie Si, Debing Zhang, and Wei Shen. AdaMuon: Adaptive Muon optimizer, 2025. arXiv preprint

2025

[28] [28]

OrScale: Orthogonalised optimization with layer-wise trust-ratio scaling, 2026

Yuxuan Lou and Yang You. OrScale: Orthogonalised optimization with layer-wise trust-ratio scaling, 2026. arXiv preprint

2026

[29] [29]

AMUSE: Anytime Muon with stable gradient evaluation, 2026

Jueun Kim, Baekrok Shin, Jihun Yun, Beomhan Baek, Minhak Song, and Chulhee Yun. AMUSE: Anytime Muon with stable gradient evaluation, 2026. arXiv preprint

2026

[30] [30]

TrasMuon: Trust-region adaptive scaling for orthogonalized momentum optimizers, 2026

Peng Cheng, Jiucheng Zang, Qingnan Li, Liheng Ma, Yufei Cui, Yingxue Zhang, Boxing Chen, Ming Jian, and Wen Tong. TrasMuon: Trust-region adaptive scaling for orthogonalized momentum optimizers, 2026. arXiv preprint

2026

[31] [31]

Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

Clarissa Lauditi, Cengiz Pehlevan, and Blake Bordelon. Spectral dynamics in deep networks: Feature learning, outlier escape, and learning rate transfer. 2026. ArXiv preprint 2605.07870. DOI 10.48550/arXiv.2605.07870

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.07870 2026

[32] [32]

Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J

Essential AI, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J. Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, and Ashis...

2025

[33] [33]

An empirical model of large-batch training, 2018

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training, 2018

2018

[34] [34]

Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V

Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. InProceedings of the International Conference on Learning Representations (ICLR), 2018

2018

[35] [35]

How does critical batch size scale in pre-training?, 2024

Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, and Sham Kakade. How does critical batch size scale in pre-training?, 2024

2024

[36] [36]

Critical batch size revisited: A simple empirical approach to large-batch language model training, 2025

William Merrill, Shane Arora, Dirk Groeneveld, and Hannaneh Hajishirzi. Critical batch size revisited: A simple empirical approach to large-batch language model training, 2025

2025

[37] [37]

The Newton-Muon optimizer, 2026

Zhehang Du and Weijie Su. The Newton-Muon optimizer, 2026. arXiv preprint

2026

[38] [38]

Muon 2: Boosting Muon via adaptive second-moment preconditioning, 2026

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, and Zheng Zhang. Muon 2: Boosting Muon via adaptive second-moment preconditioning, 2026. arXiv preprint. 15

2026

[39] [39]

Spectral flattening is all Muon needs: How orthogonalization controls learning rate and convergence, 2026

Tien-Phat Nguyen, Truong Nguyen, Minh-Phuc Truong, Tuc Nguyen, James Bailey, and Trung Le. Spectral flattening is all Muon needs: How orthogonalization controls learning rate and convergence, 2026. arXiv preprint

2026

[40] [40]

Muon is not that special: Random or inverted spectra work just as well, 2026

Zakhar Shumaylov, Natha¨ el Da Costa, Peter Zaika, B´ alint Mucs´ anyi, Alex Massucco, Yoav Gelberg, Carola- Bibiane Sch¨ onlieb, Yarin Gal, and Philipp Hennig. Muon is not that special: Random or inverted spectra work just as well, 2026. arXiv preprint

2026

[41] [41]

Tetiana Parshakova, Ahmed Khaled, Michael Crawshaw, Guillaume Garrigos, and Robert M. Gower. Muon does not converge on convex lipschitz functions, 2026. arXiv preprint

2026

[42] [42]

Rethinking Muon beyond pretraining: Spectral failures and high-pass remedies for VLA and RLVR, 2026

Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella, and Sijia Liu. Rethinking Muon beyond pretraining: Spectral failures and high-pass remedies for VLA and RLVR, 2026. arXiv preprint

2026

[43] [43]

Southworth, Shuai Jiang, Daniel McBride, Eric C

Ben S. Southworth, Shuai Jiang, Daniel McBride, Eric C. Cyr, and Stephen Thomas. Muon in vision transformers: Optimizer-recipe interactions and gradient spectra, 2026. arXiv preprint

2026

[44] [44]

When Muon optimizer meets adversarial training: A theoretical and empirical study, 2026

Jun Yan, Weiquan Huang, Jiankai Zuo, Yujian Mo, Xi Fang, Chengliang Wu, and Zeming Wei. When Muon optimizer meets adversarial training: A theoretical and empirical study, 2026. arXiv preprint

2026

[45] [45]

DP-Muon: Differentially private optimization via matrix-orthogonalized momentum, 2026

Jihwan Kim and Chenglin Fan. DP-Muon: Differentially private optimization via matrix-orthogonalized momentum, 2026. arXiv preprint

2026

[46] [46]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations (ICLR), 2019

2019

[47] [47]

Perturbation bounds for the polar decomposition.SIAM Journal on Matrix Analysis and Applications, 14(2):588–597, 1993

Roy Mathias. Perturbation bounds for the polar decomposition.SIAM Journal on Matrix Analysis and Applications, 14(2):588–597, 1993

1993

[48] [48]

G. W. Stewart and Ji-Guang Sun.Matrix Perturbation Theory. Academic Press, 1990. A Scope Comparison with Concurrent Work Table 2: Scope comparison: this paper versus concurrent work on Muon and Muon-like optimizers. Referenced from Section 3. Work Setting Main theorem Relation

1990

[49] [49]

Classification Max-margin implicit bias Complementary

[50] [50]

Classification Max-margin (homogeneous) Complementary

[51] [51]

LoRA (reg.) Uniform spectral growth Consistent w/ Thm. 1

[52] [52]

Equal-rate PC learning Consistent w/ Thm

Bilinear cls. Equal-rate PC learning Consistent w/ Thm. 1(iv)

[53] [53]

LLM pretraining Practical scaling Empirical motivation

[54] [54]

4 This paper Regression (MSE) Cond

Polar-map SGD Convergence-rateB crit Complementary to Thm. 4 This paper Regression (MSE) Cond. spectral dyn. onM— B Proofs of Main Results B.1 Full Proof of Theorem 1: Spectral Dynamics We prove the four parts in order. The pairwise spectral functional is Rpw(W) =− X i̸=j log σi(W) +σ j(W) .(12) 16 Setup. W∈R m×n has SVD W = UΣV ⊤, Σ = diag(σ1, . . . , σr...

2000