pith. machine review for the scientific record. sign in

arxiv: 2605.13079 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Muonspectral flatteningorthogonalizationNewton-Schulzlearning ratemomentumgradient descentconvergence
0
0 comments X

The pith

Muon orthogonalizes its momentum buffer to flatten the gradient spectrum, allowing stable learning rates scaled to the average singular value rather than the largest.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Muon performs Newton-Schulz orthogonalization on the momentum buffer before each parameter update, setting all singular values to one. This spectral flattening changes the stability bound for the learning rate from the maximum singular value of the gradient, which constrains standard SGD, to the average singular value. The paper also shows that the same flattening acts as a preconditioner that improves the convergence factor under a Kronecker-factored curvature assumption, with the gain depending on the gradient covariance spectrum. Experiments demonstrate that Muon remains stable at step sizes where SGD diverges immediately and attains accuracy targets several epochs sooner even when using identical learning rates. The results supply a geometric reason for the observed advantages of this orthogonalization step.

Core claim

By replacing the singular values of the momentum buffer with ones via Newton-Schulz iterations, Muon flattens the spectrum so that its maximal stable step size scales with the average singular value of the gradient instead of the largest. Recast as a preconditioned method, Muon improves the effective convergence factor in proportion to the spread in the spectrum of the gradient covariance under a Kronecker-factored model of curvature.

What carries the argument

Newton-Schulz orthogonalization of the momentum buffer that replaces every singular value with one, flattening the spectrum for the update direction.

If this is right

  • Muon tolerates learning rates proportional to the average gradient singular value without divergence.
  • The convergence speed-up scales with the non-uniformity of the gradient covariance spectrum.
  • Training reaches target accuracy in fewer epochs at the same step size compared to SGD.
  • Stability holds even when the loss landscape would cause standard gradient descent to diverge quickly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar spectral flattening could be applied to other momentum-based methods to increase their stable learning-rate range.
  • Testing the method on architectures where the Kronecker-factored assumption fails would reveal whether the convergence benefit generalizes.
  • The geometric view suggests combining Muon with curvature-aware preconditioners for additional gains.
  • Extensions to non-convex settings may require verifying that the average singular value still governs stability.

Load-bearing premise

The claimed improvement to the effective convergence factor relies on modeling the loss curvature with a Kronecker-factored approximation.

What would settle it

Observing divergence of Muon at a learning rate set to the average singular value of the gradient, particularly when the singular values vary widely, would contradict the stability result.

Figures

Figures reproduced from arXiv: 2605.13079 by James Bailey, Minh-Phuc Truong, Tien-Phat Nguyen, Trung Le, Truong Nguyen, Tuc Nguyen.

Figure 1
Figure 1. Figure 1: Training loss curves under different learning rates. At higher learning rates, SGD diverges [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise values of λmax(X⊤X) for trained CifarNet models with and without Batch Normalization. Batch Normalization substantially reduces the spectral scale of layer inputs, especially in deeper layers. 5.1.1 A new perpestive on normalization layer. Analysis. From Section 4.1.2, we observe that the maximal stable learning rate of a layer under both SGD and Muon is proportional to 1 λmax(X⊤X) , where X is … view at source ↗
Figure 3
Figure 3. Figure 3: Training loss and validation accuracy of CifarNet with and without FrobNorm under Muon [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training and Validation Accuracy vs. Epoch. Muon accelerates the learning process, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training Loss vs. Step. Muon exhibits a consistently steeper loss descent compared to [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Empirical Convergence Ratio (rt). A smaller rt implies a faster linear convergence rate. Muon maintains a consistently lower rt throughout training, supporting the preconditioned convergence improvement in Theorem 5. Standard Convergence Metrics [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: At a high learning rate, SGD shows a rapid increase in both quantities, indicating that [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: Early training dynamics (first 50 steps) of Gradient and Parameter norms. The top row [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Epochs required to reach specific validation accuracy thresholds. Lower bars indicate faster [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training behavior of a GPT-2-style Transformer trained with FrobNorm at a high learning [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training and Validation Accuracy vs. Epoch with best learning rates (Muon at [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training Loss vs. Step with best learning rates (Muon at [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Epochs to reach validation accuracy thresholds with best learning rates (Muon at [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Empirical Convergence Ratio (rt) with best learning rates (Muon at 0.1, SGD at 0.01). Muon maintains a consistently lower rt throughout training. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
read the original abstract

Muon orthogonalizes the momentum buffer before each update, replacing its singular values with ones via Newton-Schulz iterations. This simple change lets Muon tolerate far larger learning rates and converge faster than other optimizers, but why? We show that the mechanism is spectral flattening, and develop two results around it. First, we prove that Muon's maximal stable step size scales with the average singular value of the gradient rather than the largest, which bottlenecks standard gradient descent. Second, we recast Muon as a preconditioned gradient method and show, under a Kronecker-factored curvature model, that it improves the effective convergence factor, with the improvement controlled by the spectrum of the gradient covariance. Extensive experiments validate both results: Muon remains stable at learning rates that cause SGD to diverge within the first few iterations, and reaches accuracy milestones several epochs earlier even at identical step sizes. Taken together, our results offer a principled, geometric explanation for Muon's empirical success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Muon's orthogonalization of the momentum buffer via Newton-Schulz iterations achieves spectral flattening, allowing the maximal stable step size to scale with the average singular value of the gradient (rather than the largest, which limits standard gradient descent). It further recasts Muon as a preconditioned method and shows, under a Kronecker-factored curvature model, an improved effective convergence factor controlled by the spectrum of the gradient covariance. Experiments demonstrate stability at learning rates where SGD diverges and faster convergence to accuracy milestones at matched step sizes.

Significance. If the central derivations hold, the work supplies a geometric account of Muon's empirical advantages, linking orthogonalization directly to step-size stability and convergence rates. This could guide principled extensions of spectral preconditioning in deep-learning optimizers. The explicit proofs and controlled experiments are strengths, though the Kronecker assumption and approximation quality limit immediate generality.

major comments (2)
  1. [stability proof] Stability result (first main theorem): the proof that the maximal step size scales with the average singular value assumes exact unit singular values post-orthogonalization. The implementation uses a fixed number of Newton-Schulz iterations, which leave residual approximation error for ill-conditioned inputs; this error can restore dependence on the largest singular value, undermining the claimed scaling.
  2. [convergence analysis] Convergence-factor improvement (second main result): the derivation is performed under an explicit Kronecker-factored curvature model for the loss landscape. If this model is chosen or fitted to reproduce observed behavior, the reported improvement reduces in part to the modeling assumption rather than an independent geometric consequence of spectral flattening.
minor comments (2)
  1. [experiments] The manuscript should report the exact number of Newton-Schulz iterations used in all experiments together with measured residual norms on the singular values.
  2. [figures] Error bars or multiple random seeds are not described for the stability and convergence plots; their inclusion would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our theoretical results. We address each major point below with clarifications and proposed revisions.

read point-by-point responses
  1. Referee: Stability result (first main theorem): the proof that the maximal step size scales with the average singular value of the gradient assumes exact unit singular values post-orthogonalization. The implementation uses a fixed number of Newton-Schulz iterations, which leave residual approximation error for ill-conditioned inputs; this error can restore dependence on the largest singular value, undermining the claimed scaling.

    Authors: We acknowledge that the stability theorem is derived under the assumption of exact orthogonalization (unit singular values). The practical implementation employs a fixed number of Newton-Schulz iterations (typically 5), which yields an approximation. In the revised manuscript we will add a dedicated subsection analyzing the residual error of the iteration. We will include a convergence bound for Newton-Schulz on matrices with bounded condition number and report empirical measurements showing that, for the gradient spectra encountered in the networks studied, the largest singular value after 5 iterations deviates from 1 by less than 0.01. This error is shown to preserve the claimed scaling with the average singular value up to a small constant factor. Additional ablation experiments will quantify stability as a function of iteration count. These changes constitute a partial revision. revision: partial

  2. Referee: Convergence-factor improvement (second main result): the derivation is performed under an explicit Kronecker-factored curvature model for the loss landscape. If this model is chosen or fitted to reproduce observed behavior, the reported improvement reduces in part to the modeling assumption rather than an independent geometric consequence of spectral flattening.

    Authors: The Kronecker-factored curvature model is a standard modeling choice in the analysis of preconditioned and second-order methods (as in K-FAC and related work) and is not fitted to the observed optimizer behavior. Under this model the improvement in the effective convergence factor follows directly from the spectral flattening property of orthogonalization. The derivation is therefore a geometric consequence conditional on the curvature structure, which we state explicitly. The experiments provide separate empirical support by demonstrating stability and faster convergence at learning rates where the theory predicts gains, without relying on model fitting. In revision we will expand the discussion to emphasize the modeling assumptions and their justification from the literature, while clarifying that the geometric insight is tied to the model. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's two central results—a proof that Muon's maximal stable step size scales with the average (rather than largest) singular value of the gradient, and an improvement in effective convergence factor under a Kronecker-factored curvature model—are derived from the explicit geometric effect of replacing singular values with ones via orthogonalization and from the stated analytical model, respectively. These steps rely on standard matrix properties and conditional assumptions presented as independent of the target claims, without any reduction of predictions to fitted parameters, self-definitional loops, or load-bearing self-citations by construction. The analysis treats the idealized orthogonalization case for the stability bound (standard in convergence proofs) and does not smuggle ansatzes or rename known results; the derivations remain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The second theoretical result rests on the domain assumption of a Kronecker-factored curvature model; the first result on spectral scaling appears to follow from linear-algebra properties of orthogonal matrices and does not introduce additional free parameters or invented entities.

axioms (1)
  • domain assumption Kronecker-factored curvature model
    Invoked to recast Muon as a preconditioned gradient method and to derive the improvement in effective convergence factor controlled by the gradient covariance spectrum.

pith-pipeline@v0.9.0 · 5481 in / 1299 out tokens · 56606 ms · 2026-05-14T19:50:59.028164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,

  2. [2]

    Muon optimizes under spectral norm constraints

    Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

  3. [3]

    Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M. Gower. An exploration of non- Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

  4. [4]

    When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

    Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning? arXiv preprint arXiv:2512.04299,

  5. [5]

    Effective quantization of muon optimizer states.arXiv preprint arXiv:2509.23106, 2025

    Aman Gupta, Rafael Celente, Abhishek Shivanna, Daniel Thomas Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, and Sathiya Keerthi. On quantizing the state of the Muon optimizer.arXiv preprint arXiv:2509.23106,

  6. [6]

    Adam: A Method for Stochastic Optimization

    Accessed: 2026-04-21. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  7. [7]

    Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization.arXiv preprint arXiv:2503.12645,

    Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization.arXiv preprint arXiv:2503.12645,

  8. [8]

    Jiaxiang Li and Mingyi Hong

    URL https: //api.semanticscholar.org/CorpusID:18268744. Jiaxiang Li and Mingyi Hong. A note on the convergence of Muon and further.arXiv preprint arXiv:2502.02900,

  9. [9]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982,

  10. [10]

    Muon: Training and trade-offs with latent attention and MoE.arXiv preprint arXiv:2509.24406,

    Sushant Mehta, Raj Dandekar, Rajat Dandekar, and Sreedath Panat. Muon: Training and trade-offs with latent attention and MoE.arXiv preprint arXiv:2509.24406,

  11. [11]

    Saurabh Page, Advait Joshi, and S. S. Sonawane. Muonall: Muon variant for efficient finetuning of large language models.arXiv preprint arXiv:2511.06086,

  12. [12]

    Ishaan Shah, Anthony M

    doi: 10.1109/TMI.2016.2536809. Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J. Shah, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222,

  13. [13]

    AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025

    Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005,

  14. [14]

    Muon optimizer accelerates grokking.arXiv preprint arXiv:2504.16041,

    Amund Tveit, Bjørn Remseth, and Arve Skogvold. Muon optimizer accelerates grokking.arXiv preprint arXiv:2504.16041,

  15. [15]

    Adam improves muon: Adaptive moment estimation with orthogonalized momentum.arXiv preprint arXiv:2602.17080, 2026

    Minxin Zhang, Yuxuan Liu, and Hayden Schaeffer. Adam improves Muon: Adaptive moment estimation with orthogonalized momentum.arXiv preprint arXiv:2602.17080,