Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

Jiayu Zhang; Tianyi Lin

Scale-invariant first-order methods for matrix optimization under heavy-tailed noise require Omega(min{m,n} epsilon^-(3p-2)/(p-1)) oracle calls when the aspect ratio is extreme.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 18:25 UTC pith:FPEVKMNQ

load-bearing objection The paper pins down a dimension-dependent lower bound for scale-invariant first-order methods under heavy-tailed noise and matches it with batched and transported Scion algorithms.

arxiv 2605.18528 v2 pith:FPEVKMNQ submitted 2026-05-18 math.OC cs.LG

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

Jiayu Zhang , Tianyi Lin This is my paper

classification math.OC cs.LG

keywords scale-invariant optimizationheavy-tailed noisestochastic nonconvex optimizationmatrix normsneural network trainingoracle complexitydimension dependenceScion method

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies nonconvex stochastic optimization over rectangular matrices equipped with general norms, where the objective is to reach an epsilon-stationary point and the stochastic gradients have only a finite p-th moment. It proves that any scale-invariant first-order method using the spectral norm must incur a linear dependence on the smaller dimension min{m,n} in its oracle complexity whenever the matrix is sufficiently tall or wide. A batched Scion procedure matches this lower bound exactly, while a transported Scion procedure improves the exponent when the Hessian is additionally Lipschitz. These results matter because neural-network training routinely employs scale-invariant updates and encounters heavy-tailed gradient noise, so the bounds quantify unavoidable dimension-dependent costs for this practical class of methods.

Core claim

In nonconvex smooth stochastic optimization over R^{m x n} with p-th-moment heavy-tailed noise, when max{m,n}/(min{m,n})^2 is large enough, every scale-invariant first-order method that uses the spectral norm requires Omega(min{m,n} epsilon^{-(3p-2)/(p-1)}) oracle calls to reach an epsilon-stationary point. A batched Scion method attains the matching upper bound O(min{m,n} epsilon^{-(3p-2)/(p-1)}). When the Hessian is Lipschitz continuous, a transported Scion method further improves the rate to O(min{m,n} epsilon^{-(5p-3)/(2p-2)}).

What carries the argument

The restricted class of scale-invariant first-order methods equipped with the spectral norm, together with the batched and transported Scion algorithms constructed to achieve the matching and improved rates.

Load-bearing premise

The lower bound holds only inside the restricted family of scale-invariant first-order methods rather than for arbitrary first-order methods.

What would settle it

Exhibiting any scale-invariant first-order method that reaches an epsilon-stationary point in o(min{m,n} epsilon^{-(3p-2)/(p-1)}) oracle calls on a sequence of tall or wide matrices with heavy-tailed noise would falsify the claimed lower bound.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Scale-invariant methods cannot escape a linear factor of the smaller matrix dimension in their complexity under the stated noise model.
The batched Scion method saturates the lower bound for any p greater than 1.
Hessian Lipschitz continuity permits a strictly better exponent even while preserving scale invariance and spectral-norm geometry.
The transported Scion procedure remains compatible with the practical heuristics used in neural-network training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dimension dependence may force practitioners to increase batch sizes or layer widths in a coordinated way for very unbalanced layers.
The transported construction suggests that injecting curvature information can partially offset the penalty paid for heavy tails.
Similar lower-bound techniques could apply to other scale-invariant problems such as matrix factorization or attention layers.
Empirical verification on networks with deliberately unbalanced layer dimensions would test whether the theoretical gap between batched and transported Scion appears in wall-clock time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper pins down a dimension-dependent lower bound for scale-invariant first-order methods under heavy-tailed noise and matches it with batched and transported Scion algorithms.

read the letter

The main takeaway is a lower bound showing that any scale-invariant first-order method using the spectral norm needs Ω(min{m,n} ε^-(3p-2)/(p-1)) oracle calls on matrix problems when max{m,n}/(min{m,n})^2 is large enough, under p-moment heavy-tailed noise. They match this exactly with a batched Scion method and improve it to O(min{m,n} ε^-(5p-3)/(2p-2)) with a transported Scion variant that uses Hessian Lipschitz continuity.

What stands out as new is the explicit aspect-ratio dependence in the lower bound and the two algorithm constructions that stay inside the scale-invariant class while handling the noise. The paper also folds in some practical heuristics and runs them on neural nets of varying sizes.

It does a solid job linking scale-invariance (for hyperparameter transfer across model sizes) to the matrix norm geometry and to heavy-tailed gradients, both of which matter in practice. The rates are concrete and dimension-aware, which is useful when thinking about large models.

The softer parts are that the lower bound applies only to scale-invariant methods, not arbitrary first-order ones; that restriction is deliberate for the neural-net setting but narrows the claim. The faster transported rate needs Hessian Lipschitz continuity, an assumption that rarely holds exactly for neural losses. The experiments are only sketched, so it is hard to judge whether the theoretical gains show up in wall-clock time or against strong baselines.

This is for people working on theoretical guarantees for deep-learning optimizers. Anyone who cares about worst-case rates under realistic noise models will find the bounds and constructions worth reading. The work is coherent on its own terms and the central claims line up without obvious circularity, so it deserves a serious referee to check the proofs and the experimental details.

Referee Report

0 major / 3 minor

Summary. The manuscript studies nonconvex stochastic optimization over matrices in R^{m×n} equipped with general norms, focusing on scale-invariant first-order methods under p-th moment heavy-tailed noise. It establishes a dimension-dependent lower bound of Ω(min{m,n} ε^{-(3p-2)/(p-1)}) oracle calls for any scale-invariant method using the spectral norm when max{m,n}/(min{m,n})^2 is sufficiently large. A batched Scion method is shown to achieve a matching O(min{m,n} ε^{-(3p-2)/(p-1)}) upper bound. Under an additional Hessian-Lipschitz assumption, a transported Scion method improves the rate to O(min{m,n} ε^{-(5p-3)/(2p-2)}). The work concludes with practical heuristics and empirical evaluations on neural network architectures of varying sizes.

Significance. If the stated bounds hold, the paper supplies the first explicit dimension-dependent complexity characterization for the restricted but practically relevant class of scale-invariant methods, together with matching algorithms and a higher-order improvement. The explicit construction of the lower-bound function class, the matching upper bounds for the proposed Scion variants, and the empirical validation across model sizes constitute concrete strengths that advance the theoretical understanding of norm geometry and heavy-tailed noise in neural network optimization.

minor comments (3)

[Abstract and §1] The abstract states the lower bound applies specifically to scale-invariant methods with the spectral norm, yet the setup is introduced for general norms; a brief clarification in §1 or §2 on why the lower-bound construction is specialized to the spectral case (and whether analogous results hold for other norms) would improve readability.
[§4] Notation for the batched and transported Scion methods is introduced without an explicit algorithmic listing or pseudocode in the main text; adding a compact algorithm box (e.g., Algorithm 1) would make the distinction between the two variants and the role of the transport map easier to follow.
[Theorem 3.1 (or equivalent)] The condition “when max{m,n}/(min{m,n})^2 is large enough” is used for the lower bound but is not quantified with an explicit threshold; stating the minimal aspect-ratio requirement (even as a sufficiently large constant) would make the result statement self-contained.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its significance in providing the first explicit dimension-dependent complexity results for scale-invariant methods under heavy-tailed noise, and recommendation of minor revision. We appreciate the constructive feedback.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper establishes a dimension-dependent lower bound on oracle complexity for any scale-invariant first-order method (under the stated condition on m,n and spectral norm) and then constructs batched Scion and transported Scion algorithms that achieve matching or improved upper bounds. These are standard worst-case complexity results over a function class and noise model; the abstract and strongest claim explicitly restrict the lower-bound class to scale-invariant methods, which is a modeling choice rather than a self-referential definition. No equations reduce a claimed rate to a fitted parameter, no load-bearing self-citation chains appear, and no ansatz or renaming is invoked to force the result. The derivation chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Claims rest on the p-th moment bounded noise model and the definition of scale-invariance; no free parameters are visible in the abstract. The Scion methods are newly introduced algorithmic constructions.

axioms (2)

domain assumption Stochastic gradients satisfy a p-th moment bound for some p > 1
Explicitly stated as the noise model for the complexity results.
domain assumption The problem is nonconvex smooth stochastic optimization over matrices equipped with a general norm
Defines the setting in which scale-invariance and the spectral norm are considered.

invented entities (2)

batched Scion method no independent evidence
purpose: Scale-invariant algorithm achieving the matching upper bound
Newly proposed to attain the lower-bound rate.
transported Scion method no independent evidence
purpose: Variant that exploits Hessian Lipschitz continuity for improved rate
Introduced to obtain the higher-order smoothness bound.

pith-pipeline@v0.9.1-grok · 5875 in / 1622 out tokens · 44308 ms · 2026-06-30T18:25:16.878685+00:00 · methodology

0 comments

read the original abstract

A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. Scale-invariant methods become important because their normalized layerwise updates can not only support hyperparameter transfer across model sizes but exploit input-output matrix norm geometry. At the same time, stochastic gradient noises in deep learning are often far from sub-Gaussian and may exhibit heavy tails. These crucial observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences remain underexplored. In particular, it is unclear what dimension dependence is unavoidable for scale-invariant methods with general input-output matrix norm, and whether higher-order smoothness can accelerate training under heavy-tailed noise. We study these questions through nonconvex smooth stochastic optimization over $\mathbb{R}^{m\times n}$ with general norms, where the goal is to achieve an $\epsilon$-stationary point under $p^{\mathrm{th}}$-moment heavy-tailed noise. Our first contribution is a dimension-dependent lower bound: when $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ is large enough, any scale-invariant first-order method with spectral norm requires $\Omega(\min\{m, n\}\epsilon^{-\frac{3p-2}{p-1}})$ oracle calls. We prove that a batched Scion method with spectral norm achieves the matching upper bound of $O(\min\{m, n\}\epsilon^{-\frac{3p-2}{p-1}})$. To exploit higher-order smoothness, we propose a transported Scion method and improve the bound to $O(\min\{m, n\}\epsilon^{-\frac{5p-3}{2p-2}})$ when the norm is spectral and the Hessian is Lipschitz. Finally, we incorporate practical heuristics into our transported method and evaluate it across multiple architectures and model sizes, demonstrating its flexibility and compatibility in training neural networks.

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

148 extracted references · 148 canonical work pages · 15 internal anchors

[1]

K. Ahn, N. Amsel, and J. Langford. Dion2: A simple method to shrink matrix in M uon. ArXiv Preprint: 2512.16928, 2025 a

work page arXiv 2025
[2]

K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford. Dion: Distributed orthonormalized updates. ArXiv Preprint: 2504.05295, 2025 b

work page arXiv 2025
[3]

Amsel, D

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The P olar E xpress: Optimal matrix sign methods and their application to the M uon algorithm. In ICLR, 2026. URL https://openreview.net/forum?id=yRtgZ1K8hO

work page 2026
[4]

K. An, Y. Liu, R. Pan, Y. Ren, S. Ma, D. Goldfarb, and T. Zhang. ASGO : Adaptive structured gradient optimization. In NeurIPS, 2025. URL https://openreview.net/forum?id=fru52tkjHf

work page 2025
[5]

Arjevani, Y

Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 0 (1): 0 165--214, 2023

work page 2023
[6]

J. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. In NIPS Workshop on Deep Learning Symposium, 2016. URL https://openreview.net/forum?id=BJLa_ZC9

work page 2016
[7]

K. Ball. An elementary introduction to modern convex geometry. In Silvio Levy, editor, Flavors of Geometry, volume 31 of Mathematical Sciences Research Institute Publications, pages 1--58. Cambridge University Press, 1997

work page 1997
[8]

K. Ball, E. A. Carlen, and E. H. Lieb. Sharp uniform convexity and smoothness inequalities for trace norms. Inventiones Mathematicae, 115 0 (1): 0 463--482, 1994

work page 1994
[9]

Balles and P

L. Balles and P. Hennig. Dissecting A dam: The sign, magnitude and variance of stochastic gradients. In ICML, pages 404--413. PMLR, 2018

work page 2018
[10]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. ArXiv Preprint: 2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[11]

Bernstein and L

J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology. In NeurIPS Workshop on Optimization for Machine Learning, 2024. URL https://openreview.net/forum?id=ux18f5nOpD

work page 2024
[12]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning. In ICML, pages 3920--3930. PMLR, 2025

work page 2025
[13]

Bernstein, Y-X

J. Bernstein, Y-X. Wang, K. Azizzadenesheli, and A. Anandkumar. Sign SGD : Compressed optimisation for non-convex problems. In ICML, pages 560--569. PMLR, 2018

work page 2018
[14]

Bernstein, J

J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar. Sign SGD with majority vote is communication efficient and fault tolerant. In ICLR, 2019. URL https://openreview.net/forum?id=BJxhijAcY7

work page 2019
[15]

Bubeck, N

S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59 0 (11): 0 7711--7717, 2013

work page 2013
[16]

D. E. Carlson, E. Collins, Y-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. In NeurIPS, pages 2971--2979, 2015

work page 2015
[17]

L. Chen, B. Liu, K. Liang, and Q. Liu. Lion secretly solves a constrained optimization: As L yapunov predicts. In ICLR, 2024. URL https://openreview.net/forum?id=e4xS9ZarDr

work page 2024
[18]

L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints. In NeurIPS Workshop on Optimization for Machine Learning, 2025. URL https://openreview.net/forum?id=bBSq533vFH

work page 2025
[19]

X. Chen, C. Liang, D. Huang, E. Real, K. Wang, H. Pham, X. Dong, T. Luong, C-J. Hsieh, Y. Lu, et al. Symbolic discovery of optimization algorithms. In NeurIPS, pages 49205--49233, 2023

work page 2023
[20]

Chezhegov, K

S. Chezhegov, K. Yaroslav, A. Semenov, A. Beznosikov, A. Gasnikov, S. Horv \'a th, M. Tak \'a c , and E. Gorbunov. Clipping improves A dam- N orm and A da G rad- N orm when the noise is heavy-tailed. In ICML, pages 10269--10333. PMLR, 2025

work page 2025
[21]

K. Cho, B. Van Merri \"e nboer, C . Gul c ehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, pages 1724--1734, 2014

work page 2014
[22]

Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

Sayantan Choudhury, Xiaoran Cheng, Martin Tak \'a c , Sen Na, and Mladen Kolar. Muon with nesterov momentum: Heavy-tailed noise and (randomized) inexact polar decomposition. arXiv preprint arXiv:2605.06884, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Cutkosky and H

A. Cutkosky and H. Mehta. Momentum improves normalized SGD . In ICML, pages 2260--2268. PMLR, 2020

work page 2020
[24]

Cutkosky and H

A. Cutkosky and H. Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. In NeurIPS, pages 4883--4895, 2021

work page 2021
[25]

D'Angelo, M

F. D'Angelo, M. Andriushchenko, A. Varre, and N. Flammarion. Why do we need weight decay in modern deep learning? In NeurIPS, pages 23191--23223, 2024

work page 2024
[26]

T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec

work page 2024
[27]

arXiv preprint arXiv:2512.04299 , year=

D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning? ArXiv Preprint: 2512.04299, 2025

work page arXiv 2025
[28]

Defazio, X

A. Defazio, X. Yang, H. Mehta, K. Mishchenko, A. Khaled, and A. Cutkosky. The road less scheduled. In NeurIPS, pages 9974--10007, 2024

work page 2024
[29]

N. S. Dey, B. C. Zhang, L. Noci, M. Li, B. Bordelon, S. Bergsma, C. Pehlevan, B. Hanin, and J. Hestness. Don't be lazy: Complete P enables compute-efficient deep transformers. In NeurIPS, 2025. URL https://openreview.net/forum?id=lMU2kaMANl

work page 2025
[30]

S. Diao, Y. Yang, Y. Fu, X. Dong, D. Su, M. Kliegl, Z. Chen, P. Belcak, Y. Suhara, H. Yin, et al. Nemotron- CLIMB : Clustering-based iterative data mixture bootstrapping for language model pre-training. ArXiv Preprint: 2504.13161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

T. Dozat. Incorporating N esterov momentum into A dam. In ICLR Workshop Track, 2016. URL https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ

work page 2016
[32]

Dragutinovi \'c and R

S. Dragutinovi \'c and R. Ranganath. To use or not to use M uon: How simplicity bias in optimizers matters. In ICLR Workshop on Scientific Methods for Understanding Deep Learning, 2026. URL https://openreview.net/forum?id=GsZtgQf3IM

work page 2026
[33]

The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv

Z. Du and W. Su. The N ewton- M uon optimizer. arXiv preprint arXiv:2604.01472, 2026

work page arXiv 2026
[34]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: 0 2121--2159, 2011

work page 2011
[35]

S. S. Duvvuri, F. Devvrit, R. Anil, C-J. Hsieh, and I. S. Dhillon. Combining axes preconditioners through K ronecker approximation for deep learning. In ICLR, 2024. URL https://openreview.net/forum?id=8j9hz8DVi8

work page 2024
[36]

C. Fan, M. Schmidt, and C. Thrampoulidis. Implicit bias of spectral descent and M uon on multiclass separable data. In NeurIPS, 2026. URL https://openreview.net/forum?id=Zn2ajV1kTQ

work page 2026
[37]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013

work page 2013
[38]

Glorot and Y

X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249--256. PMLR, 2010

work page 2010
[39]

Goldfarb, Y

D. Goldfarb, Y. Ren, and A. Bahamou. Practical quasi- N ewton methods for training deep neural networks. In NeurIPS, pages 2386--2396, 2020

work page 2020
[40]

W. Gong, J. Zazo, Q. Luo, P. Wang, J. Hensman, and C. Ma. ARO : A new lens on matrix optimization for large models. ArXiv Preprint: 2602.09006, 2026

work page arXiv 2026
[41]

Insights on muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

A. Gonon, A-A. Mu s at, and N. Boumal. Insights on M uon from simple quadratics. ArXiv Preprint: 2602.11948, 2026

work page arXiv 2026
[42]

Gorbunov, A

E. Gorbunov, A. Sadiev, M. Danilova, S. Horv \'a th, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richt \'a rik. High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise. In ICML, pages 15951--16070. PMLR, 2024

work page 2024
[43]

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Eitan Gronich and Gal Vardi. The implicit bias of adam and muon on smooth homogeneous neural networks. arXiv preprint arXiv:2602.16340, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Grosse and J

R. Grosse and J. Martens. A K ronecker-factored approximate F isher matrix for convolution layers. In ICML, pages 573--582. PMLR, 2016

work page 2016
[45]

arXiv preprint arXiv:2601.23000 , year=

Y. Gu and Z. Xie. MANO : Restriking manifold optimization for LLM training. ArXiv Preprint: 2601.23000, 2026

work page arXiv 2026
[46]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In ICML, pages 1842--1850. PMLR, 2018

work page 2018
[47]

Gurbuzbalaban, U

M. Gurbuzbalaban, U. Simsekli, and L. Zhu. The heavy-tail phenomenon in SGD . In ICML, pages 3964--3975. PMLR, 2021

work page 2021
[48]

C. He, Z. Deng, and Z. Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training. ArXiv Preprint: 2509.11983, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770--778. IEEE, 2016

work page 2016
[50]

Henry, P

A. Henry, P. R. Dachapally, S. S. Pawar, and Y. Chen. Query-key normalization for transformers. In Findings of the ACL: EMNLP, pages 4246--4253, 2020

work page 2020
[51]

G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18 0 (7): 0 1527--1554, 2006

work page 2006
[52]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997

work page 1997
[53]

Hsu and S

D. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In ICML, pages 37--45. PMLR, 2014

work page 2014
[54]

LiMuon: Light and Fast Muon Optimizer for Large Models

F. Huang, Y. Luo, and S. Chen. Limuon: Light and fast M uon optimizer for large models. ArXiv Preprint: 2509.14562, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

H \"u bler, I

F. H \"u bler, I. Fatkhullin, and N. He. From gradient clipping to normalization for heavy tailed SGD . In AISTATS, pages 2413--2421. PMLR, 2025

work page 2025
[56]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448--456. PMLR, 2015

work page 2015
[57]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3 0 (1): 0 79--87, 1991

work page 1991
[58]

Jiang, D

R. Jiang, D. Maladkar, and A. Mokhtari. Provable complexity improvement of A da G rad over SGD : Upper and lower bounds in stochastic non-convex optimization. In COLT, pages 3124--3158. PMLR, 2025

work page 2025
[59]

K. Jordan. 94\ ArXiv Preprint: 2404.00498, 2024

work page arXiv 2024
[60]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

work page 2024
[61]

S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In ICML, pages 3252--3261. PMLR, 2019

work page 2019
[62]

A. Karpath. nanochat: The best C hat GPT that \ 100 can buy, 2025. URL https://github.com/karpathy/nanochat

work page 2025
[63]

G. Y. Kim and M-h. Oh. Convergence of M uon with N ewton- S chulz. In ICLR, 2026. URL https://openreview.net/forum?id=lJSfxtLpLm

work page 2026
[64]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. URL https://openreview.net/forum?id=8gmWwjFyLj

work page 2015
[65]

D. Kovalev. Understanding gradient orthogonalization for deep learning via non- E uclidean trust-region optimization. ArXiv Preprint: 2503.12645, 2025

work page arXiv 2025
[66]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto, April 2009. URL https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf

work page 2009
[67]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image N et classification with deep convolutional neural networks. In NeurIPS, pages 1097--1105, 2012

work page 2012
[68]

Kunstner and F

F. Kunstner and F. Bach. Scaling laws for gradient descent and sign descent for linear bigram models under Z ipf s law. In NeurIPS, 2025. URL https://openreview.net/forum?id=VUbwLjLkws

work page 2025
[69]

Kunstner, J

F. Kunstner, J. Chen, J. W. Lavington, and M. Schmidt. Noise is not the main factor behind the gap between SGD and A dam on transformers, but sign descent might be. In ICLR, 2023. URL https://openreview.net/forum?id=a65YK0cqH8g

work page 2023
[70]

Kunstner, A

F. Kunstner, A. Milligan, R. Yadav, M. Schmidt, and A. Bietti. Heavy-tailed class imbalance and why A dam outperforms gradient descent on language models. In NeurIPS, pages 30106--30148, 2024

work page 2024
[71]

Large, Y

T. Large, Y. Liu, M. Huh, P. Isola, H. Bahng, and J. Bernstein. Scalable optimization in the modular norm. In NeurIPS, pages 73501--73548, 2024

work page 2024
[72]

T. T-K. Lau, Q. Long, and W. Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective. ArXiv Preprint: 2505.21799, 2025

work page arXiv 2025
[73]

LeCun, L

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998

work page 1998
[74]

H. Li, A. Rakhlin, and A. Jadbabaie. Convergence of A dam under relaxed assumptions. In NeurIPS, pages 52166--52196, 2023

work page 2023
[75]

H. Li, Y. Dong, and Z. Lin. On the o( d /t^ 1/4 ) convergence rate of RMSP rop and its momentum extension measured by _1 norm. The Journal of Machine Learning Research, 26 0 (131): 0 1--25, 2025 a

work page 2025
[76]

Li and M

J. Li and M. Hong. A note on the convergence of M uon. ArXiv Preprint: 2502.02900, 2025

work page arXiv 2025
[77]

J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Gadre, H. Bansal, E. Guha, S. Keh, K. Arora, et al. Data C omp- LM : In search of the next generation of training sets for language models. In NeurIPS, pages 14200--14282, 2024

work page 2024
[78]

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

Yibang Li, Bihari Lal Pandey, Ravi Sah, Andi Han, Cyrus Mostajeran, Pratik Jawanpuria, and Bamdev Mishra. Intrinsic muon: Spectral optimization on riemannian matrix manifolds. arXiv preprint arXiv:2605.09238, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[79]

Z. Li, L. Liu, C. Liang, W. Chen, and T. Zhao. Nor M uon: Making M uon more efficient and scalable. ArXiv Preprint: 2510.05491, 2025 b

work page arXiv 2025
[80]

J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for LLM training. ArXiv Preprint: 2502.16982, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

K. Ahn, N. Amsel, and J. Langford. Dion2: A simple method to shrink matrix in M uon. ArXiv Preprint: 2512.16928, 2025 a

work page arXiv 2025

[2] [2]

K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford. Dion: Distributed orthonormalized updates. ArXiv Preprint: 2504.05295, 2025 b

work page arXiv 2025

[3] [3]

Amsel, D

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The P olar E xpress: Optimal matrix sign methods and their application to the M uon algorithm. In ICLR, 2026. URL https://openreview.net/forum?id=yRtgZ1K8hO

work page 2026

[4] [4]

K. An, Y. Liu, R. Pan, Y. Ren, S. Ma, D. Goldfarb, and T. Zhang. ASGO : Adaptive structured gradient optimization. In NeurIPS, 2025. URL https://openreview.net/forum?id=fru52tkjHf

work page 2025

[5] [5]

Arjevani, Y

Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 0 (1): 0 165--214, 2023

work page 2023

[6] [6]

J. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. In NIPS Workshop on Deep Learning Symposium, 2016. URL https://openreview.net/forum?id=BJLa_ZC9

work page 2016

[7] [7]

K. Ball. An elementary introduction to modern convex geometry. In Silvio Levy, editor, Flavors of Geometry, volume 31 of Mathematical Sciences Research Institute Publications, pages 1--58. Cambridge University Press, 1997

work page 1997

[8] [8]

K. Ball, E. A. Carlen, and E. H. Lieb. Sharp uniform convexity and smoothness inequalities for trace norms. Inventiones Mathematicae, 115 0 (1): 0 463--482, 1994

work page 1994

[9] [9]

Balles and P

L. Balles and P. Hennig. Dissecting A dam: The sign, magnitude and variance of stochastic gradients. In ICML, pages 404--413. PMLR, 2018

work page 2018

[10] [10]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. ArXiv Preprint: 2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[11] [11]

Bernstein and L

J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology. In NeurIPS Workshop on Optimization for Machine Learning, 2024. URL https://openreview.net/forum?id=ux18f5nOpD

work page 2024

[12] [12]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning. In ICML, pages 3920--3930. PMLR, 2025

work page 2025

[13] [13]

Bernstein, Y-X

J. Bernstein, Y-X. Wang, K. Azizzadenesheli, and A. Anandkumar. Sign SGD : Compressed optimisation for non-convex problems. In ICML, pages 560--569. PMLR, 2018

work page 2018

[14] [14]

Bernstein, J

J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar. Sign SGD with majority vote is communication efficient and fault tolerant. In ICLR, 2019. URL https://openreview.net/forum?id=BJxhijAcY7

work page 2019

[15] [15]

Bubeck, N

S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59 0 (11): 0 7711--7717, 2013

work page 2013

[16] [16]

D. E. Carlson, E. Collins, Y-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. In NeurIPS, pages 2971--2979, 2015

work page 2015

[17] [17]

L. Chen, B. Liu, K. Liang, and Q. Liu. Lion secretly solves a constrained optimization: As L yapunov predicts. In ICLR, 2024. URL https://openreview.net/forum?id=e4xS9ZarDr

work page 2024

[18] [18]

L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints. In NeurIPS Workshop on Optimization for Machine Learning, 2025. URL https://openreview.net/forum?id=bBSq533vFH

work page 2025

[19] [19]

X. Chen, C. Liang, D. Huang, E. Real, K. Wang, H. Pham, X. Dong, T. Luong, C-J. Hsieh, Y. Lu, et al. Symbolic discovery of optimization algorithms. In NeurIPS, pages 49205--49233, 2023

work page 2023

[20] [20]

Chezhegov, K

S. Chezhegov, K. Yaroslav, A. Semenov, A. Beznosikov, A. Gasnikov, S. Horv \'a th, M. Tak \'a c , and E. Gorbunov. Clipping improves A dam- N orm and A da G rad- N orm when the noise is heavy-tailed. In ICML, pages 10269--10333. PMLR, 2025

work page 2025

[21] [21]

K. Cho, B. Van Merri \"e nboer, C . Gul c ehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, pages 1724--1734, 2014

work page 2014

[22] [22]

Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

Sayantan Choudhury, Xiaoran Cheng, Martin Tak \'a c , Sen Na, and Mladen Kolar. Muon with nesterov momentum: Heavy-tailed noise and (randomized) inexact polar decomposition. arXiv preprint arXiv:2605.06884, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Cutkosky and H

A. Cutkosky and H. Mehta. Momentum improves normalized SGD . In ICML, pages 2260--2268. PMLR, 2020

work page 2020

[24] [24]

Cutkosky and H

A. Cutkosky and H. Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. In NeurIPS, pages 4883--4895, 2021

work page 2021

[25] [25]

D'Angelo, M

F. D'Angelo, M. Andriushchenko, A. Varre, and N. Flammarion. Why do we need weight decay in modern deep learning? In NeurIPS, pages 23191--23223, 2024

work page 2024

[26] [26]

T. Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec

work page 2024

[27] [27]

arXiv preprint arXiv:2512.04299 , year=

D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning? ArXiv Preprint: 2512.04299, 2025

work page arXiv 2025

[28] [28]

Defazio, X

A. Defazio, X. Yang, H. Mehta, K. Mishchenko, A. Khaled, and A. Cutkosky. The road less scheduled. In NeurIPS, pages 9974--10007, 2024

work page 2024

[29] [29]

N. S. Dey, B. C. Zhang, L. Noci, M. Li, B. Bordelon, S. Bergsma, C. Pehlevan, B. Hanin, and J. Hestness. Don't be lazy: Complete P enables compute-efficient deep transformers. In NeurIPS, 2025. URL https://openreview.net/forum?id=lMU2kaMANl

work page 2025

[30] [30]

S. Diao, Y. Yang, Y. Fu, X. Dong, D. Su, M. Kliegl, Z. Chen, P. Belcak, Y. Suhara, H. Yin, et al. Nemotron- CLIMB : Clustering-based iterative data mixture bootstrapping for language model pre-training. ArXiv Preprint: 2504.13161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

T. Dozat. Incorporating N esterov momentum into A dam. In ICLR Workshop Track, 2016. URL https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ

work page 2016

[32] [32]

Dragutinovi \'c and R

S. Dragutinovi \'c and R. Ranganath. To use or not to use M uon: How simplicity bias in optimizers matters. In ICLR Workshop on Scientific Methods for Understanding Deep Learning, 2026. URL https://openreview.net/forum?id=GsZtgQf3IM

work page 2026

[33] [33]

The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv

Z. Du and W. Su. The N ewton- M uon optimizer. arXiv preprint arXiv:2604.01472, 2026

work page arXiv 2026

[34] [34]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: 0 2121--2159, 2011

work page 2011

[35] [35]

S. S. Duvvuri, F. Devvrit, R. Anil, C-J. Hsieh, and I. S. Dhillon. Combining axes preconditioners through K ronecker approximation for deep learning. In ICLR, 2024. URL https://openreview.net/forum?id=8j9hz8DVi8

work page 2024

[36] [36]

C. Fan, M. Schmidt, and C. Thrampoulidis. Implicit bias of spectral descent and M uon on multiclass separable data. In NeurIPS, 2026. URL https://openreview.net/forum?id=Zn2ajV1kTQ

work page 2026

[37] [37]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013

work page 2013

[38] [38]

Glorot and Y

X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pages 249--256. PMLR, 2010

work page 2010

[39] [39]

Goldfarb, Y

D. Goldfarb, Y. Ren, and A. Bahamou. Practical quasi- N ewton methods for training deep neural networks. In NeurIPS, pages 2386--2396, 2020

work page 2020

[40] [40]

W. Gong, J. Zazo, Q. Luo, P. Wang, J. Hensman, and C. Ma. ARO : A new lens on matrix optimization for large models. ArXiv Preprint: 2602.09006, 2026

work page arXiv 2026

[41] [41]

Insights on muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

A. Gonon, A-A. Mu s at, and N. Boumal. Insights on M uon from simple quadratics. ArXiv Preprint: 2602.11948, 2026

work page arXiv 2026

[42] [42]

Gorbunov, A

E. Gorbunov, A. Sadiev, M. Danilova, S. Horv \'a th, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richt \'a rik. High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise. In ICML, pages 15951--16070. PMLR, 2024

work page 2024

[43] [43]

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Eitan Gronich and Gal Vardi. The implicit bias of adam and muon on smooth homogeneous neural networks. arXiv preprint arXiv:2602.16340, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

Grosse and J

R. Grosse and J. Martens. A K ronecker-factored approximate F isher matrix for convolution layers. In ICML, pages 573--582. PMLR, 2016

work page 2016

[45] [45]

arXiv preprint arXiv:2601.23000 , year=

Y. Gu and Z. Xie. MANO : Restriking manifold optimization for LLM training. ArXiv Preprint: 2601.23000, 2026

work page arXiv 2026

[46] [46]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In ICML, pages 1842--1850. PMLR, 2018

work page 2018

[47] [47]

Gurbuzbalaban, U

M. Gurbuzbalaban, U. Simsekli, and L. Zhu. The heavy-tail phenomenon in SGD . In ICML, pages 3964--3975. PMLR, 2021

work page 2021

[48] [48]

C. He, Z. Deng, and Z. Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training. ArXiv Preprint: 2509.11983, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770--778. IEEE, 2016

work page 2016

[50] [50]

Henry, P

A. Henry, P. R. Dachapally, S. S. Pawar, and Y. Chen. Query-key normalization for transformers. In Findings of the ACL: EMNLP, pages 4246--4253, 2020

work page 2020

[51] [51]

G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18 0 (7): 0 1527--1554, 2006

work page 2006

[52] [52]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997

work page 1997

[53] [53]

Hsu and S

D. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In ICML, pages 37--45. PMLR, 2014

work page 2014

[54] [54]

LiMuon: Light and Fast Muon Optimizer for Large Models

F. Huang, Y. Luo, and S. Chen. Limuon: Light and fast M uon optimizer for large models. ArXiv Preprint: 2509.14562, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

H \"u bler, I

F. H \"u bler, I. Fatkhullin, and N. He. From gradient clipping to normalization for heavy tailed SGD . In AISTATS, pages 2413--2421. PMLR, 2025

work page 2025

[56] [56]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448--456. PMLR, 2015

work page 2015

[57] [57]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3 0 (1): 0 79--87, 1991

work page 1991

[58] [58]

Jiang, D

R. Jiang, D. Maladkar, and A. Mokhtari. Provable complexity improvement of A da G rad over SGD : Upper and lower bounds in stochastic non-convex optimization. In COLT, pages 3124--3158. PMLR, 2025

work page 2025

[59] [59]

K. Jordan. 94\ ArXiv Preprint: 2404.00498, 2024

work page arXiv 2024

[60] [60]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

work page 2024

[61] [61]

S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In ICML, pages 3252--3261. PMLR, 2019

work page 2019

[62] [62]

A. Karpath. nanochat: The best C hat GPT that \ 100 can buy, 2025. URL https://github.com/karpathy/nanochat

work page 2025

[63] [63]

G. Y. Kim and M-h. Oh. Convergence of M uon with N ewton- S chulz. In ICLR, 2026. URL https://openreview.net/forum?id=lJSfxtLpLm

work page 2026

[64] [64]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. URL https://openreview.net/forum?id=8gmWwjFyLj

work page 2015

[65] [65]

D. Kovalev. Understanding gradient orthogonalization for deep learning via non- E uclidean trust-region optimization. ArXiv Preprint: 2503.12645, 2025

work page arXiv 2025

[66] [66]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto, April 2009. URL https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf

work page 2009

[67] [67]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image N et classification with deep convolutional neural networks. In NeurIPS, pages 1097--1105, 2012

work page 2012

[68] [68]

Kunstner and F

F. Kunstner and F. Bach. Scaling laws for gradient descent and sign descent for linear bigram models under Z ipf s law. In NeurIPS, 2025. URL https://openreview.net/forum?id=VUbwLjLkws

work page 2025

[69] [69]

Kunstner, J

F. Kunstner, J. Chen, J. W. Lavington, and M. Schmidt. Noise is not the main factor behind the gap between SGD and A dam on transformers, but sign descent might be. In ICLR, 2023. URL https://openreview.net/forum?id=a65YK0cqH8g

work page 2023

[70] [70]

Kunstner, A

F. Kunstner, A. Milligan, R. Yadav, M. Schmidt, and A. Bietti. Heavy-tailed class imbalance and why A dam outperforms gradient descent on language models. In NeurIPS, pages 30106--30148, 2024

work page 2024

[71] [71]

Large, Y

T. Large, Y. Liu, M. Huh, P. Isola, H. Bahng, and J. Bernstein. Scalable optimization in the modular norm. In NeurIPS, pages 73501--73548, 2024

work page 2024

[72] [72]

T. T-K. Lau, Q. Long, and W. Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective. ArXiv Preprint: 2505.21799, 2025

work page arXiv 2025

[73] [73]

LeCun, L

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998

work page 1998

[74] [74]

H. Li, A. Rakhlin, and A. Jadbabaie. Convergence of A dam under relaxed assumptions. In NeurIPS, pages 52166--52196, 2023

work page 2023

[75] [75]

H. Li, Y. Dong, and Z. Lin. On the o( d /t^ 1/4 ) convergence rate of RMSP rop and its momentum extension measured by _1 norm. The Journal of Machine Learning Research, 26 0 (131): 0 1--25, 2025 a

work page 2025

[76] [76]

Li and M

J. Li and M. Hong. A note on the convergence of M uon. ArXiv Preprint: 2502.02900, 2025

work page arXiv 2025

[77] [77]

J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Gadre, H. Bansal, E. Guha, S. Keh, K. Arora, et al. Data C omp- LM : In search of the next generation of training sets for language models. In NeurIPS, pages 14200--14282, 2024

work page 2024

[78] [78]

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

Yibang Li, Bihari Lal Pandey, Ravi Sah, Andi Han, Cyrus Mostajeran, Pratik Jawanpuria, and Bamdev Mishra. Intrinsic muon: Spectral optimization on riemannian matrix manifolds. arXiv preprint arXiv:2605.09238, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[79] [79]

Z. Li, L. Liu, C. Liang, W. Chen, and T. Zhao. Nor M uon: Making M uon more efficient and scalable. ArXiv Preprint: 2510.05491, 2025 b

work page arXiv 2025

[80] [80]

J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for LLM training. ArXiv Preprint: 2502.16982, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025